Diary of a Graphics Programmer: October 2009

Recently I had the need to pack bit fields into 32-bit channels of a 32:32:32:32 fp render target.
First of all we can assume that all registers in the pixel shader operate in 32-bit precision and output data is written into a 32-bit fp render target. The 32-bit (or single-precision) floating point format uses 1 sign, 8-bits of exponent, and 23 bits of mantissa following the IEEE 754 standard.

To maintain maximum precision during floating-point computations, most computations use normalized values. Keeping floating-point numbers normalized is beneficial because it maintains the maximum number of bits of precision in a computation. If several higher-order bits of the mantissa are all zero, the mantissa has that many fewer bits of precision available for computation. Therefore a floating-point computation will be more accurate if it involves only normalized values whose higher-order mantissa bit contains one.

The IEEE 754 32-bit floating-point format specifies special cases in case the bits in the exponent are all set to zeros or ones. If all exponent bits are set, then the number represents either =/- infinity or a NaN (not-a-number), depending on the mantissa value. If all exponent bits are zero, then the number is denormalized and automatically gets flushed to zero as specified in the Direct3D 10 single-precision floating-point specifications (see Nicolas Thibieroz, "Packing Arbitrary Bit Fields into 16-bit Floating-Point Render Targets in DirectX10", ShaderX7).

When packing bit values, those cases need to be avoided.

// Pack three positive normalized numbers between 0.0 and 1.0 into a 32-bit fp

// channel of a render target

float Pack3PNForFP32(float3 channel)

{

// layout of a 32-bit fp register

// SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM

// 1 sign bit; 8 bits for the exponent and 23 bits for the mantissa

uint uValue;

// pack x

uValue = ((uint)(channel.x * 65535.0 + 0.5)); // goes from bit 0 to 15

// pack y in EMMMMMMM

uValue |= ((uint)(channel.y * 255.0 + 0.5)) << 16

// pack z in SEEEEEEE

// the last E will never be 1b because the upper value is 254

// max value is 11111110 == 254

// this prevents the bits of the exponents to become all 1

// range is 1.. 254

// to prevent an exponent that is 0 we add 1.0

uValue |= ((uint)(channel.z * 253.0 + 1.5)) << 24

return asfloat(uValue);

}

// unpack three positive normalized values from a 32-bit float

float3 Unpack3PNFromFP32(float fFloatFromFP32)

{

float a, b, c, d;

uint uValue;

uint uInputFloat = asuint(fFloatFromFP32);

// unpack a

// mask out all the stuff above 16-bit with 0xFFFF

a = ((uInputFloat) & 0xFFFF) / 65535.0;

b = ((uInputFloat >> 16) & 0xFF) / 255.0;

// extract the 1..254 value range and subtract 1

// ending up with 0..253

c = (((uInputFloat >> 24) & 0xFF) - 1.0) / 253.0;

return float3(a, b, c);

}

Diary of a Graphics Programmer

Thursday, October 22, 2009

River of Lights II

Thursday, October 15, 2009

BitMasks / Packing Data into fp Render Targets