Thursday, October 15, 2009

BitMasks / Packing Data into fp Render Targets

Recently I had the need to pack bit fields into 32-bit channels of a 32:32:32:32 fp render target.
First of all we can assume that all registers in the pixel shader operate in 32-bit precision and output data is written into a 32-bit fp render target. The 32-bit (or single-precision) floating point format uses 1 sign, 8-bits of exponent, and 23 bits of mantissa following the IEEE 754 standard.

To maintain maximum precision during floating-point computations, most computations use normalized values. Keeping floating-point numbers normalized is beneficial because it maintains the maximum number of bits of precision in a computation. If several higher-order bits of the mantissa are all zero, the mantissa has that many fewer bits of precision available for computation. Therefore a floating-point computation will be more accurate if it involves only normalized values whose higher-order mantissa bit contains one.

The IEEE 754 32-bit floating-point format specifies special cases in case the bits in the exponent are all set to zeros or ones. If all exponent bits are set, then the number represents either =/- infinity or a NaN (not-a-number), depending on the mantissa value. If all exponent bits are zero, then the number is denormalized and automatically gets flushed to zero as specified in the Direct3D 10 single-precision floating-point specifications (see Nicolas Thibieroz, "Packing Arbitrary Bit Fields into 16-bit Floating-Point Render Targets in DirectX10", ShaderX7).

When packing bit values, those cases need to be avoided.

// Pack three positive normalized numbers between 0.0 and 1.0 into a 32-bit fp
// channel of a render target
float Pack3PNForFP32(float3 channel)
// layout of a 32-bit fp register
// 1 sign bit; 8 bits for the exponent and 23 bits for the mantissa
uint uValue;

// pack x
uValue = ((uint)(channel.x * 65535.0 + 0.5)); // goes from bit 0 to 15
// pack y in EMMMMMMM
uValue |= ((uint)(channel.y * 255.0 + 0.5)) << 16

// pack z in SEEEEEEE
// the last E will never be 1b because the upper value is 254
// max value is 11111110 == 254
// this prevents the bits of the exponents to become all 1
// range is 1.. 254
// to prevent an exponent that is 0 we add 1.0
uValue |= ((uint)(channel.z * 253.0 + 1.5)) << 24

return asfloat(uValue);

// unpack three positive normalized values from a 32-bit float
float3 Unpack3PNFromFP32(float fFloatFromFP32)
float a, b, c, d;
uint uValue;
uint uInputFloat = asuint(fFloatFromFP32);
// unpack a
// mask out all the stuff above 16-bit with 0xFFFF
a = ((uInputFloat) & 0xFFFF) / 65535.0;
b = ((uInputFloat >> 16) & 0xFF) / 255.0;
// extract the 1..254 value range and subtract 1
// ending up with 0..253
c = (((uInputFloat >> 24) & 0xFF) - 1.0) / 253.0;

return float3(a, b, c);


Anonymous said...

How many bits do your original 3 values have to begin with?

From the looks of your algo, the X channel would be 16b, Y would be 8b, but Z is 8b too, but with limited range [0, 254].

Basically, to avoid denorms+NaNs, you could say you really only have 3*10b available in a 32b chunk, which could be preferable in some case.

And if you only need 3*8b, then they can all fit in the mantissa in a normalized number (just add 2^24 and you're pretty much done).

Anonymous said...

How do I pack r8g8b8a6 in f32?

Wolfgang Engel said...

well the first three channels are easy. You would need to make sure that the whole exponent doesn't go to 0 or 1.
One way to do this is always add 1 and subtract it later.

Wolfgang Engel said...

I would think a good strategy is to check if the exponent is zero and then set bit 30? I am sure someone has a better idea.

Unknown said...

Another way of doing this from a post I did in 2005:

More for packing offline packing then unpacking in the shader in 3 float instructions.

I have used it personally for packing blend shape vertex streams.

MikeR said...

Great article. Really interesting how you had to work around FP exponent special cases.