More work-in-progress shots.
Thursday, October 22, 2009
Thursday, October 15, 2009
BitMasks / Packing Data into fp Render Targets
Recently I had the need to pack bit fields into 32-bit channels of a 32:32:32:32 fp render target.
First of all we can assume that all registers in the pixel shader operate in 32-bit precision and output data is written into a 32-bit fp render target. The 32-bit (or single-precision) floating point format uses 1 sign, 8-bits of exponent, and 23 bits of mantissa following the IEEE 754 standard.
To maintain maximum precision during floating-point computations, most computations use normalized values. Keeping floating-point numbers normalized is beneficial because it maintains the maximum number of bits of precision in a computation. If several higher-order bits of the mantissa are all zero, the mantissa has that many fewer bits of precision available for computation. Therefore a floating-point computation will be more accurate if it involves only normalized values whose higher-order mantissa bit contains one.
The IEEE 754 32-bit floating-point format specifies special cases in case the bits in the exponent are all set to zeros or ones. If all exponent bits are set, then the number represents either =/- infinity or a NaN (not-a-number), depending on the mantissa value. If all exponent bits are zero, then the number is denormalized and automatically gets flushed to zero as specified in the Direct3D 10 single-precision floating-point specifications (see Nicolas Thibieroz, "Packing Arbitrary Bit Fields into 16-bit Floating-Point Render Targets in DirectX10", ShaderX7).
When packing bit values, those cases need to be avoided.
First of all we can assume that all registers in the pixel shader operate in 32-bit precision and output data is written into a 32-bit fp render target. The 32-bit (or single-precision) floating point format uses 1 sign, 8-bits of exponent, and 23 bits of mantissa following the IEEE 754 standard.
To maintain maximum precision during floating-point computations, most computations use normalized values. Keeping floating-point numbers normalized is beneficial because it maintains the maximum number of bits of precision in a computation. If several higher-order bits of the mantissa are all zero, the mantissa has that many fewer bits of precision available for computation. Therefore a floating-point computation will be more accurate if it involves only normalized values whose higher-order mantissa bit contains one.
The IEEE 754 32-bit floating-point format specifies special cases in case the bits in the exponent are all set to zeros or ones. If all exponent bits are set, then the number represents either =/- infinity or a NaN (not-a-number), depending on the mantissa value. If all exponent bits are zero, then the number is denormalized and automatically gets flushed to zero as specified in the Direct3D 10 single-precision floating-point specifications (see Nicolas Thibieroz, "Packing Arbitrary Bit Fields into 16-bit Floating-Point Render Targets in DirectX10", ShaderX7).
When packing bit values, those cases need to be avoided.
// Pack three positive normalized numbers between 0.0 and 1.0 into a 32-bit fp
// channel of a render target
float Pack3PNForFP32(float3 channel)
{
// layout of a 32-bit fp register
// SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM
// 1 sign bit; 8 bits for the exponent and 23 bits for the mantissa
uint uValue;
// pack x
uValue = ((uint)(channel.x * 65535.0 + 0.5)); // goes from bit 0 to 15
// pack y in EMMMMMMM
uValue |= ((uint)(channel.y * 255.0 + 0.5)) << 16
// pack z in SEEEEEEE
// the last E will never be 1b because the upper value is 254
// max value is 11111110 == 254
// this prevents the bits of the exponents to become all 1
// range is 1.. 254
// to prevent an exponent that is 0 we add 1.0
uValue |= ((uint)(channel.z * 253.0 + 1.5)) << 24
return asfloat(uValue);
}
// unpack three positive normalized values from a 32-bit float
float3 Unpack3PNFromFP32(float fFloatFromFP32)
{
float a, b, c, d;
uint uValue;
uint uInputFloat = asuint(fFloatFromFP32);
// unpack a
// mask out all the stuff above 16-bit with 0xFFFF
a = ((uInputFloat) & 0xFFFF) / 65535.0;
b = ((uInputFloat >> 16) & 0xFF) / 255.0;
// extract the 1..254 value range and subtract 1
// ending up with 0..253
c = (((uInputFloat >> 24) & 0xFF) - 1.0) / 253.0;
return float3(a, b, c);
}