First of all we can assume that all registers in the pixel shader operate in 32-bit precision and output data is written into a 32-bit fp render target. The 32-bit (or single-precision) floating point format uses 1 sign, 8-bits of exponent, and 23 bits of mantissa following the IEEE 754 standard.
To maintain maximum precision during floating-point computations, most computations use normalized values. Keeping floating-point numbers normalized is beneficial because it maintains the maximum number of bits of precision in a computation. If several higher-order bits of the mantissa are all zero, the mantissa has that many fewer bits of precision available for computation. Therefore a floating-point computation will be more accurate if it involves only normalized values whose higher-order mantissa bit contains one.
The IEEE 754 32-bit floating-point format specifies special cases in case the bits in the exponent are all set to zeros or ones. If all exponent bits are set, then the number represents either =/- infinity or a NaN (not-a-number), depending on the mantissa value. If all exponent bits are zero, then the number is denormalized and automatically gets flushed to zero as specified in the Direct3D 10 single-precision floating-point specifications (see Nicolas Thibieroz, "Packing Arbitrary Bit Fields into 16-bit Floating-Point Render Targets in DirectX10", ShaderX7).
When packing bit values, those cases need to be avoided.
// Pack three positive normalized numbers between 0.0 and 1.0 into a 32-bit fp
// channel of a render target
float Pack3PNForFP32(float3 channel)
// layout of a 32-bit fp register
// 1 sign bit; 8 bits for the exponent and 23 bits for the mantissa
// pack x
uValue = ((uint)(channel.x * 65535.0 + 0.5)); // goes from bit 0 to 15
// pack y in EMMMMMMM
uValue |= ((uint)(channel.y * 255.0 + 0.5)) << 16
// pack z in SEEEEEEE
// the last E will never be 1b because the upper value is 254
// max value is 11111110 == 254
// this prevents the bits of the exponents to become all 1
// range is 1.. 254
// to prevent an exponent that is 0 we add 1.0
uValue |= ((uint)(channel.z * 253.0 + 1.5)) << 24
// unpack three positive normalized values from a 32-bit float
float3 Unpack3PNFromFP32(float fFloatFromFP32)
float a, b, c, d;
uint uInputFloat = asuint(fFloatFromFP32);
// unpack a
// mask out all the stuff above 16-bit with 0xFFFF
a = ((uInputFloat) & 0xFFFF) / 65535.0;
b = ((uInputFloat >> 16) & 0xFF) / 255.0;
// extract the 1..254 value range and subtract 1
// ending up with 0..253
c = (((uInputFloat >> 24) & 0xFF) - 1.0) / 253.0;
return float3(a, b, c);