First of all we can assume that all registers in the pixel shader operate in 32-bit precision and output data is written into a 32-bit fp render target. The 32-bit (or single-precision) floating point format uses 1 sign, 8-bits of exponent, and 23 bits of mantissa following the IEEE 754 standard.

To maintain maximum precision during floating-point computations, most computations use normalized values. Keeping floating-point numbers normalized is beneficial because it maintains the maximum number of bits of precision in a computation. If several higher-order bits of the mantissa are all zero, the mantissa has that many fewer bits of precision available for computation. Therefore a floating-point computation will be more accurate if it involves only normalized values whose higher-order mantissa bit contains one.

The IEEE 754 32-bit floating-point format specifies special cases in case the bits in the exponent are all set to zeros or ones. If all exponent bits are set, then the number represents either =/- infinity or a NaN (not-a-number), depending on the mantissa value. If all exponent bits are zero, then the number is denormalized and automatically gets flushed to zero as specified in the Direct3D 10 single-precision floating-point specifications (see Nicolas Thibieroz, "Packing Arbitrary Bit Fields into 16-bit Floating-Point Render Targets in DirectX10", ShaderX7).

When packing bit values, those cases need to be avoided.

// Pack three positive normalized numbers between 0.0 and 1.0 into a 32-bit fp

// channel of a render target

float Pack3PNForFP32(float3 channel)

{

// layout of a 32-bit fp register

// SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM

// 1 sign bit; 8 bits for the exponent and 23 bits for the mantissa

uint uValue;

// pack x

uValue = ((uint)(channel.x * 65535.0 + 0.5)); // goes from bit 0 to 15

// pack y in EMMMMMMM

uValue |= ((uint)(channel.y * 255.0 + 0.5)) << 16

// pack z in SEEEEEEE

// the last E will never be 1b because the upper value is 254

// max value is 11111110 == 254

// this prevents the bits of the exponents to become all 1

// range is 1.. 254

// to prevent an exponent that is 0 we add 1.0

uValue |= ((uint)(channel.z * 253.0 + 1.5)) << 24

return asfloat(uValue);

}

// unpack three positive normalized values from a 32-bit float

float3 Unpack3PNFromFP32(float fFloatFromFP32)

{

float a, b, c, d;

uint uValue;

uint uInputFloat = asuint(fFloatFromFP32);

// unpack a

// mask out all the stuff above 16-bit with 0xFFFF

a = ((uInputFloat) & 0xFFFF) / 65535.0;

b = ((uInputFloat >> 16) & 0xFF) / 255.0;

// extract the 1..254 value range and subtract 1

// ending up with 0..253

c = (((uInputFloat >> 24) & 0xFF) - 1.0) / 253.0;

return float3(a, b, c);

}

## 6 comments:

How many bits do your original 3 values have to begin with?

From the looks of your algo, the X channel would be 16b, Y would be 8b, but Z is 8b too, but with limited range [0, 254].

Basically, to avoid denorms+NaNs, you could say you really only have 3*10b available in a 32b chunk, which could be preferable in some case.

And if you only need 3*8b, then they can all fit in the mantissa in a normalized number (just add 2^24 and you're pretty much done).

How do I pack r8g8b8a6 in f32?

well the first three channels are easy. You would need to make sure that the whole exponent doesn't go to 0 or 1.

One way to do this is always add 1 and subtract it later.

I would think a good strategy is to check if the exponent is zero and then set bit 30? I am sure someone has a better idea.

Another way of doing this from a post I did in 2005:

http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=265070

More for packing offline packing then unpacking in the shader in 3 float instructions.

I have used it personally for packing blend shape vertex streams.

Great article. Really interesting how you had to work around FP exponent special cases.

Post a Comment