Wednesday, March 26, 2014

Compute Shader Optimizations for AMD GPUs: Parallel Reduction

We recently looked more often into compute shader optimizations on AMD platforms. Additionally I had a UCSD class in Winter that dealt with this topic and a talk at the Sony booth at GDC 2014 that covered the same topic.
This blog post covers a common scenario while implementing a post-processing pipeline: Parallel Reduction. It uses the excellent talk given by Mark Harris a few years back as a starting point, enriched with new discoveries, credited to the new hardware platforms and AMD specifics.

The topics covered are:
  • Sequential Shared Memory (TGSM) Access: utilizing the Memory bank layout 
  • When to Unroll Loops in a compute shader 
    • Overhead of address arithmetic and loop instructions 
    • Skipping Memory Barriers: Wavefront 
  • Pre-fetching data into Shared Memory 
Most of the examples accompanying this blog post are showing a simple parallel reduction going from 1080p to 120x68. While reducing the size of the image, these examples also reduce the color color value to luminance.

Image 1 - Going from 1080p to 120x68 and from Color to Luminance

On an algorithmic level, Parallel Reduction looks like a tree-based approach:

Image 2 - Tree-based approach for Parallel Reduction

Instead of building a fully recursive kernel, which is not possible on current hardware, the algorithm mimics recursion by using a for loop.
As you will see later on, the fact that each of the invocations utilizes less threads from a pool of threads in a thread group, has some impact on performance. Let's say we allocate 256 threads in a thread pool, only the first iteration of the Parallel Reduction algorithm will use all of them. The second iteration -based on the implementation- might only use half of them and the next one again half of those etc..

TGSM Access: Utilizing the Memory bank layout
One of the first rules of thumb mentioned by Nicolas Thibieroz is dealing with the access pattern that is used to access TGSM. There is only a limited number of I/O banks and they need to be utilized in the most efficient way. It turns out that AMD and NVIDIA seem to have both 32 banks.

Image 3 - Memory banks are arranged linearly with addresses

Accessing TGSM with addresses that are 32 DWORD apart will lead to a situation where threads will use the same bank. This generates so called bank conflicts. In other words, accessing the same address from multiple threads creates bank conflicts.
The preferred method to access TGSM is to have 32 threads use 32 different banks. Usually the extreme example mentioned is the 2D array example, where you want to access memory by increasing the bank number first -you might consider this moving horizontally- and then increase the vertical direction. This way threads will hit different banks more often.
The more subtle bank conflicts happen when memory banks are accessed in non-sequential patterns. Mark Harris has shown the following example. Here is an image depicting this:

Image 4 - Memory banks accessed interleaved

The first example source is showing an implementation of this memory access pattern:

// Example for Interleaved Memory access
[numthreads(THREADX, THREADY, 1)]
void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex  )
  const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);
 uint idx = DTid.x + DTid.y * c_width; // read from structured buffer
 sharedMem[GI] = dot(Input[idx], LumVector); // store in shared memory   
 GroupMemoryBarrierWithGroupSync(); // wait until everything is transfered from device memory to shared memory

 for (uint s = 1; s < groupthreads; s *= 2)  // stride: 1, 2, 4, 8, 16, 32, 64, 128
   int index = 2 * s * GI;

(index < (groupthreads))
     sharedMem[index] += sharedMem[index + s];
 // Have the first thread write out to the output
 if (GI == 0)
  // write out the result for each thread group
  Result[Gid.xy] = sharedMem[0] / (THREADX * THREADY);

This code fetches TGSM in its for loop in a pattern as showed in Image 4. The image of a sequential access pattern is supposed to look like this:
Image 5 - Memory banks accessed sequential

The source code of the sequential access version looks like this:

[numthreads(THREADX, THREADY, 1)]
void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex  )
  const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);
 uint idx = DTid.x + DTid.y * c_width; // read from structured buffer
 sharedMem[GI] = dot(Input[idx], LumVector); // store in shared memory   
 GroupMemoryBarrierWithGroupSync(); // wait until everything is transfered from device memory to shared memory
 [unroll(groupthreads / 2)]
 for (uint s = groupthreads / 2; s > 0; s >>= 1)
  if (GI < s)
      sharedMem[GI] += sharedMem[GI + s];
 // Have the first thread write out to the output
 if (GI == 0)
  // write out the result for each thread group
  Result[Gid.xy] = sharedMem[0] / (THREADX * THREADY);


The changes are marked in red. While on previous hardware generations, this slight change in source code had some impact on the performance, it looks like on modern AMD GPUs it doesn't seem to make a difference anymore. All the measurements were done on an AMD RADEON(TM) HD 6770, an AMD RADEON(TM) HD 7750 and an AMD RADEON(TM) HD 7850:

Image 5 - Peformance of Interleaved / Sequential TGSM access pattern

In case of our example program re-arranging the access pattern doesn't make a difference. It might be that the driver re-arranges the code already or the hardware re-directs the accesses.

Unroll the Loops
A likely overhead of the shaders shown above is the instruction overhead of ancillary instructions that are not loads, stores, or arithmetic instructions for core computations. In other words address arithmetic and loop instructions overhead.
Thread groups that access Thread Group Shared Memory are automatically broken down into hardware schedulable groups of threads. In case of NVIDIA, those are called Warps and there are 32 threads in a warp, and in case of AMD they are called a Wavefront and there are 64 threads in a wavefront (there is a finer level of granularity regarding Wavefronts that we won't cover here). Instructions are SIMD synchronous within a Warp or Wavefront. That means as long as the number of threads executed are below 32 for NVIDIA or below 64 on AMD, a memory barrier is not necessary.
In case of the tree like algorithm that is used for Parallel Reduction as shown in Image 2, the number of threads utilized in a loop are decreasing. As soon as they are below 32 or 64, a memory barrier shouldn't be necessary anymore.
This means that unrolling loops not only might save some ancillary instructions but also might reduce the number of memory barriers used in a compute shader. Source code for an unrolled loop might look like this:

… // like the previous shader
if (groupthreads >= 256)
     if (GI < 128)
           sharedMem[GI] += sharedMem[GI + 128];
// AMD - 64 / NVIDIA - 32
if (GI < 64)

 if (groupthreads >= 64) sharedMem[GI] += sharedMem[GI + 32];
 if (groupthreads >= 32) sharedMem[GI] += sharedMem[GI + 16];
 if (groupthreads >= 16)sharedMem[GI] += sharedMem[GI + 8];
 if (groupthreads >= 8)sharedMem[GI] += sharedMem[GI + 4];
 if (groupthreads >= 4)sharedMem[GI] += sharedMem[GI + 2];
  if (groupthreads >= 2)sharedMem[GI] += sharedMem[GI + 1];

The performance numbers for this optimization show that older hardware appreciates the effort of unrolling the loop and decreasing the number of memory barriers more than newer designs:

Image  6 - Unrolled Loops / Less Memory Barriers performance Impact

Pre-fetching two Color Values into Shared Memory
When looking at the previous shader, the only operation that utilizes all 256 threads in the thread group is the first load into shared memory. By fetching two color values from device memory and adding them already at the beginning of the shader, we could utilize the threads better.
To stay consistent with the previous Parallel Reduction shaders and offering the same 1080p to 120x68 reduction, the following shader only uses 64 threads in the thread group.

// pack two values
// like the previous shader
// store in shared memory
float temp = (dot(Input[idx * 2], LumVector) + dot(Input[idx * 2 + 1], LumVector));
sharedMem[GI] = temp;

// AMD - 64 / NVIDIA - 32
if (GI < 32)
    if (groupthreads >= 32) sharedMem[GI] += sharedMem[GI + 16];
    if (groupthreads >= 16)sharedMem[GI] += sharedMem[GI + 8];
    if (groupthreads >= 8)sharedMem[GI] += sharedMem[GI + 4];
    if (groupthreads >= 4)sharedMem[GI] += sharedMem[GI + 2];
   if (groupthreads >= 2)sharedMem[GI] += sharedMem[GI + 1];

Because the number of threads used is 64, memory barriers are not necessary.

Image  7 - Pre-fetching two color values

It seems that throughout the hardware generations, the performance benefits from fetching two values at the same time, although the number of threads per thread group were reduced to 64 from 256 are appreciated. The reduced number of threads will become a topic later on.

Pre-fetching four Color Values into Shared Memory
With the success story behind fetching two color values into TGSM, the obvious question arises, what would happen if four values would be fetched. To keep the Parallel Reduction algorithm comparable, so that it reduces from 1080p to 120x68, the threads in the thread group are reduced again.
The following shader only uses 16 threads per thread group and is therefore considered not efficient in this respect. The official rule of thumb is using a multiply of 64. On the bright side it doesn't use any memory barriers.

// pack four values
#define THREADX 4
#define THREADY 4
// like the previous shader
float temp = (dot(Input[idx * 4], LumVector) + dot(Input[idx * 4 + 1], LumVector))
            + (dot(Input[idx * 4 + 2], LumVector) + dot(Input[idx * 4 + 3], LumVector));
// store in shared memory -> no group barrier
sharedMem[IndexOfThreadInGroup] = temp;
// AMD - 64 / NVIDIA - 32
if (IndexOfThreadInGroup < 16)
  if (groupthreads >= 16)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];
  if (groupthreads >= 8)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];
  if (groupthreads >= 4)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];
  if (groupthreads >= 2)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];

The performance increase compared to the previous shader shows a nearly linear increase:

Image  8 - Pre-fetching four color values

Looking at the improvements of fetching four instead of two color values brings up the question, how would performance change if the number of threads in a thread group would be increased and the number of thread groups in the dispatch then decreased, which also leads to a higher parallel reduction because the resulting image is smaller.
The next example increases the number of threads from 16 to 64:

// pack four values
#define THREADX 8
#define THREADY 8
// like the previous shader
float temp = (dot(Input[idx * 4], LumVector) + dot(Input[idx * 4 + 1], LumVector))
            + (dot(Input[idx * 4 + 2], LumVector) + dot(Input[idx * 4 + 3], LumVector));

// store in shared memory > no group barrier
sharedMem[IndexOfThreadInGroup] = temp;

// AMD - 64 / NVIDIA - 32
if (IndexOfThreadInGroup < 64)
  if (groupthreads >= 64)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 32];
  if (groupthreads >= 32)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 16];       if (groupthreads >= 16)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];
  if (groupthreads >= 8)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];
  if (groupthreads >= 4)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];
  if (groupthreads >= 2)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];

Similar to the previous shader this shader avoids any memory barriers but it runs with 64 instead of 16 threads and it is not executed as often, because the grid size was reduced to 60 x 34.

Image  9 -  Increasing the number of Threads from 16 to 64 and decreasing the size of the result

Although the number of threads are increased, the workload of this shader is also increased due to halving the size of the resulting image in each direction. In other words this shader does more work than the previous shaders. This allows the conclusion that this shader runs faster than the previous one.

Following the successful path of increasing the number of threads, the last shader in this blog post will use 256 threads to parallel reduce the image size from 1080p to 30x17.

... // like the previous shaders
// store in shared memory 
sharedMem[IndexOfThreadInGroup] = temp;

// wait until everything is transfered from device memory to shared memory

if (groupthreads >= 256)
if (IndexOfThreadInGroup < 128)
sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 128];

// AMD - 64 / NVIDIA - 32
if (IndexOfThreadInGroup < 64)
if (groupthreads >= 64)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 32];
if (groupthreads >= 32)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 16];
if (groupthreads >= 16)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];
if (groupthreads >= 8)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];
if (groupthreads >= 4)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];
if (groupthreads >= 2)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];

With the increased number of threads we have to add memory barriers again. Nevertheless this shader runs quicker than all the previous shaders while -at the same time- doing more work:

Image  10 -  Increasing the number of Threads from 64 to 256 and decreasing the size of the result

Please note how the older GPU starts beating the newer GPU when the number of threads are increased. Overall for the 6770, we went from roughly 1 ms to close to a tenth of the original time frame. For the 7750 and the 7850 we ended up reducing the frame time to roughly a bit more than a fourth, while increasing the workload in the last two test setups. 

Like with most optimization tasks there is always more to consider and more to try out. A list of things that would be worth considering is still short but give some time, it will increase. 
If you -the valued reader of this blog- have anything you want me to try and add to this list, please let me know and I will add it to this blog post.
Overall I believe the case studies shown above should give someone a good starting point to optimize the Parallel Reduction part of a post-processing pipeline.

One other topic crucial for the performance of a post-processing pipeline is the speed of the blur kernel. Optimizations that lead to the "ultimate" blur kernel will have to wait for a future blog post :-)

Thanks to Stephan Hodes from AMD for providing feedback.

Monday, March 24, 2014

DirectX 12 Blog

Finally information about DirectX 12 is published on Matt Sandy's blog.

Today I wear my DirectX 12 T-Shirt to work ... below this shirt I am wearing the Mantle T-Shirt (... I was thinking about the order for a while but only this order can make sense ... right?).
I had the opportunity to test drive DirectX 12 in the last couple of months and it looks already great. Very excited to work with DirectX 12 and Mantle in the near future.

Thursday, March 13, 2014

GDC 2014 - Compute Shader Optimizations

I will be speaking at the Sony booth on Wednesday at 5pm on compute shader optimizations. The 15 minute talk will be broadcast on Twitch.
The talk will cover performance numbers of three different AMD GPUs: RADEON 6770, RADEON 7750 and RADEON 7850.
The main topics are:

  • Sequential Shared Memory (TGSM) Access: utilizing the Memory bank layout 
  • When to Unroll Loops in a compute shader 
    • Overhead of address arithmetic and loop instructions 
    • Skipping Memory Barriers: Wavefront 
  • Pre-fetching data into Shared Memory 
  • Packing data into Shared Memory
Looking at two different generations of AMD GPUs makes it better visible which one of the ground rules developed for GPU optimizations works on current GPUs, compared to previous generations.
This is based on some of the optimization work we did on AAA games last year. 
At Confetti we have Aura - our Dynamic Global Illumination System- and PixelPuzzle - our PostFX pipeline - running in compute.
This talk will deal with how to optimize parts of a PostFX pipeline with Compute. I am also planning to write a blog series about this.

Friday, January 10, 2014

Visual Studio 2013 - C99 support

I think using C99 in game development could be useful for large teams, especially if they are distributed over several locations.
So I thought I look a little bit closer on the support of C99 in Visual Studio 2013 (we also use VS 2013 with C99 now in my UCSD class).
The new features that are support in VS 2013 are:

New features in 2013
- variable decls
- _Bool
- compound literals
- designated initializers

Already available:
variadic macros, long long, __pragma, __FUNCTION__, and __restrict

What is missing:
- variable-length arrays (VLAs)
- Reserved keywords in C99
C99 has a few reserved keywords that are not recognized by C++:
_Bool -> this is now implemented ... see above
- restrict keyword
C99 supports the restrict keyword, which allows for certain optimizations involving pointers. For example:

    void copy(int *restrict d, const int *restrict s, int n)
        while (n-- > 0)
            *d++ = *s++;
C++ does not recognize this keyword.
A simple work-around for code that is meant to be compiled as either C or C++ is to use a macro for the restrict keyword:

    #ifdef __cplusplus
     #define restrict    /* nothing */
(This feature is likely to be provided as an extension by many C++ compilers. If it is, it is also likely to be allowed as a reference modifier as well as a pointer modifier.)

Don't know if it is in there:
- hexadecimal floating-point literals like 
float  pi = 0x3.243F6A88p+03; 
- C99 adds a few header files that are not included as part of the standard C++ library, though:


Thursday, January 2, 2014

CSE 190 - GPU Programming UCSD class Winter 2014

GPU Programming
With the new console generation and the advances in PC hardware, compute support is becoming more important in games. The new course in 2014 will therefore start with compute and we will spend about a 1/3 of the whole course talking about how it is used on next-gen consoles and in next-gen games. We will also look into several case studies and discuss the feasibility to "re-factor" existing game algorithms so that they run in compute. An emphasis is put here on effects that are traditionally used for post-processing effects.

The remaining 2 / 3 of the course will focus on the DirectX 11.2 graphics API and how it is used in games to create a rendering engine for a next-gen game. We will cover most of the fundamental concepts like the HLSL language, renderer design, lighting in games, how to generate shadows and we also discuss how transparency can be mimicked with techniques other than alpha blending.
The course will end with a survey of different real-time Global Illumination algorithms that are used in different types of games.

First Class
-- DirectX 11.2 Graphics
-- DirectX 11.2 Compute
-- Tools of the Trade - how to setup your development system
Introduction to DirectX 11.2 Compute
-- Advantages
-- Memory Model
-- Threading Model
-- DirectX 10.x support

Second Class
Simple Compute Case Studies
- PostFX Color Filters
- PostFX Parallel Reduction
- DirectX 11 Mandelbrot
- DirectX 10 Mandelbrot

Third Class
DirectCompute performance optimization
- Histogram optimization case study

Fourth Class
Direct3D 11.2 Graphics Pipeline Part 1
- Direct3D 9 vs. Direct3D 11
- Direct3D 11 vs. Direct3D 11.1
- Direct3D 11.1 vs. Direct3D 11.2
- Resources (typeless memory arrays)
- Resource Views
- Resources Access Intention
- State Objects
- Pipeline Stages
-- Input Assembler
-- Vertex Shader
-- Tesselation
-- Geometry Shader
-- Stream Out
-- Setup / Rasterizer
-- Pixel Shader
-- Output Merger
-- Video en- / decoder access

Fifth Class
Direct3D 11.2 Graphics Pipeline Part 2
--- Keywords
--- Basic Data Types
--- Vector Data Types
--- Swizzling
--- Write Masks
--- Matrices
--- Type Casting
--- SamplerState
--- Texture Objects
--- Intrinsics
--- Flow Control
-- Case Study: implementing Blinn-Phong lighting with DirectX 11.2
--- Physcially / Observational Lighting Models
--- Local / Global Lighting
--- Lighting Implementation
---- Ambient
---- Diffuse
---- Specular
---- Normal Mapping
---- Self-Shadowing
---- Point Light
---- Spot Light

Sixth Class
Physically Based Lighting
- Normalized Blinn-Phong Lighting Model
- Cook-Torrance Reflectance Model

Seventh Class
Deferred Lighting, AA
- Rendering Many Lights History
- Light Pre-Pass (LPP)
- LPP Implementation
- Efficient Light rendering on DX 9, 10, 11
- Balance Quality / Performance
- MSAA Implementation on DX 10.0, 10.1, XBOX 360, 11
Screen-Space Materials
- Skin

Eigth Class
- The Shadow Map Basics
- “Attaching” a Shadow Map frustum around a view frustum
- Multi-Frustum Shadow Maps
- Cascaded Shadow Maps (CSM) : Splitting up the View
- CSM Challenges
- Cube Shadow Maps
- Softening the Penumbra
- Soft Shadow Maps

Nineth Class
Order-Independent Transparency
- Depth Peeling
- Reverse Depth Peeling
- Per-Pixel Linked Lists

Tenth Class
Global Illumination Algorithms in Games
- Requirement for Real-Time GI
- Ambient Cubes
- Diffuse Cube Mapping
- Screen-Space Ambient Occlusion
- Screen-Space Global Illumination
- Reflective Shadow Maps
- Splatting Indirect Illumination (SII)

Each student should bring a DirectX 11.0 or higher capable notebook with Windows 7 or 8 into class. All the examples accompanying the class are build in C/C++ in Visual Studio 2013.  

Thursday, November 7, 2013

Visual Studio 2013 / Demo Skeleton Programming

I updated my demo skeleton in the google code repository. It is using now Visual Studio 2013, that now partially supports C99 and therefore can compile the code. I updated the compute shader code a bit and I upgraded Crinkler to version 1.4. The compute shader example now also compiles the shader into a header file and then Crinkler compresses this file as part of the data compression. It packs now overall to 2,955 bytes.

If you have fun with this code, let me know ... :-)

Monday, September 30, 2013

Call for a new Post-Processing Pipeline - KGC 2013 talk

This is the text version of my talk at KGC 2013.
The main motivation for the talk was the idea of looking for fundamental changes that can bring a modern Post-Processing Pipeline to the next level.
Let's look first into the short history of Post-Processing Pipelines, where we are in the moment and where we might be going in the near future.

Probably one of the first Post-Processing Pipelines appeared in the DirectX SDK around 2004. It was a first attempt to implement HDR rendering. I believe from there on we called a collection of image space effects at the end of the rendering pipeline Post-Processing pipeline. 
The idea was to re-use resources like render targets and data with as many image space effects as possible in a Post-Processing Pipeline. 
A typical collection of screen-space effects were
  • Tone-mapping + HDR rendering: the tone-mapper can be considered a dynamic contrast operator 
  • Camera effects like Depth of Field with shaped Bokeh, Motion Blur, lens flare etc..
  • Full-screen color filters like contrast, saturation, color additions and multiplications etc..
One of the first coverages of a whole collection of effects in a Post-Processing Pipeline running on XBOX 360 / PS3 was done in [Engel2007].
Since then numerous new tone mapping operators were introduced [Day2012], new more advanced Depth of Field algorithms with shaped Bokeh were covered but there was no fundamental change to the concept of the pipeline.

Call for a new Post-Processing Pipeline
Let's start with the color space: RGB is not a good color space for a post-processing pipeline. It is well known that luminance variety is more important than color variety, so it makes sense to pick a color space that has luminance in one of the channels. With the 11:11:10 render targets it would be cool to store luminance in one of the 11 bit channels. Having luminance available in the pipeline without having to go through color conversions opens up many new possibilities, from which we will cover a few below.

Global tone mapping operators didn't work out well in practice. We looked at numerous engines in the last four years and a common decision by artists was to limit the luminance values by clamping them. The reasons for this were partially in the fact that the textures didn't provide enough quality to survive a "light adaptation" without blowing out or sometimes most of their resolution was in the low-end greyscale values and there wasn't just enough resolution to mimic light adaptations. 
Another reason for this limitation was that the available resolution in the rendering pipeline with the RGB color space was not enough. Another reason for this limitation is the fact that we limited ourselves to Global tone mapping operators, because local tone mapping operators are considered too expensive.

A fixed global gamma adjustment at the end of the pipeline is partially doing "the same thing" as the tone mapping operator. It applies a contrast and might counteract the activities that the tone-mapper already does. 
So the combination of a tone-mapping operator and then the commonly used hardware gamma correction, which are both global is odd.

On a lighter note, a new Post-Processing Pipeline can add more stages. In the last couple of years, screen-space ambient occlusion, screen-space skin and screen-space reflections for dynamic objects became popular. Adding those to the Post-Processing Pipeline by trying to re-use existing resources need to be considered in the architecture of the pipeline.

Last, one of the best targets for the new compute capabilities of GPUs is the Post-Processing Pipeline. Saving memory bandwidth by merging "render target blits" and re-factoring blur kernels for thread group shared memory or GSM are considerations not further covered in the following text; but most obvious design decisions.

Let's start by looking at the an old Post-Processing Pipeline design. This is an overview I used in 2007:

A Post-Processing Pipeline Overview from 2007

A few notes on this pipeline. The tone mapping operation happens at two places. At the "final" stage for tone-mapping the final result and in the bright-pass filter for tone mapping the values before they can be considered "bright". 
The "right" way to apply tone mapping independent of the tone mapping operator you choose is to convert into a color space that exposes luminance, apply the tone mapper to luminance and then convert back to RGB. In other words: you had to convert between RGB and a different color space back and forth twice. 
In some pipelines, it was decided that this is a bit much and the tone mapper was applied to the RGB value directly. Tone mapping a RGB value with a luminance contrast operator led to "interesting" results.
Obviously this overview doesn't cover the latest Depth of Field effects with shaped Bokeh and separated near and far field Center of Confusion calculations, nevertheless it shows already a large amount of render-target to render-target blits that can be merged with compute support.

All modern rendering pipelines calculate color values in linear space; meaning every texture that is loaded is converted into linear space by the hardware, then all the color operations are applied like lighting and shadowing, post-processing and then at the end the color values are converted back by applying the gamma curve.
This separate Gamma Control is located at the end of the pipeline, situated after tone mapping and color filters. This is because the GPU hardware can apply a global gamma correction to the image after everything is rendered. 

The following paragraphs will cover some of the ideas we had to improve a Post-Processing Pipeline on a fundamental level. We implemented them into our Post-Processing Pipeline PixelPuzzle. Some of the research activities like finally replacing the "global tone mapping concept" with a better way of calculating contrast and color will have to wait for a future column.

Yxy Color Space
The first step to change a Post-Processing Pipeline in a fundamental way is to switch it to a different color space. Instead of running it in RGB we decided to use CIE Yxy through the whole pipeline. That means we convert RGB into Yxy at the beginning of the pipeline and convert back to RGB at the end. In-between all operations run on Yxy.
With CIE Yxy, the Y channel holds the luminance value. With a 11:11:10 render target, the Y channel will have 11 bits of resolution.

Instead of converting RGB to Yxy and back each time for the final tone mapping and the bright-pass stage, running the whole pipeline in Yxy means that this conversion might be only done once to Yxy and once or twice back to RGB.
Tone mapping then still happens with the Y channel in the same way it happened before. Confetti's PostFX pipeline offers eight different tone mapping operators and each of them works well in this setup.
Now one side effect of using Yxy is also that you can run the bright-pass filter as a one channel operation, which saves on modern scalar GPUs some cycles.

One other thing that Yxy allows to do is to consider the occlusion term in Screen-Space Ambient Occlusion as a member of the Y channel. So you can mix in this term and use it in interesting ways. Similar ideas apply to any other occlusion term that your pipeline might be able to use.
The choice of using CIE Yxy as the color space of choice was arbitrary. In 2007 I evaluated several different color spaces and we ended up with Yxy at the time. Here is my old table:

Pick a Color Space Table from 2007

Compared to CIE Yxy, HSV doesn't allow easily to run a blur filter kernel. The target was to leave the pipeline as unchanged as possible when picking a color space. So with Yxy, all the common Depth of Field algorithms and any other blur kernel runs unchanged in Yxy. HSV conversions also seem to be more expensive compared to RGB -> CIE XYZ -> CIE Yxy and vice versa.
There might be other color spaces similar tailored to the task.

Dynamic Local Gamma
As mentioned above, the fact that we apply a tone mapping operator and then later on a global gamma operator appears to be a bit odd. Here is what the hardware is supposed to do when it applies the gamma "correction".

Gamma Correction
The main take-away from this curve is that the same curve is applied to every pixel on screen. In other words: this curve shows an emphasis on dark areas independently of the pixel being very bright or very dark.
Whatever curve the tone-mapper will apply, the gamma correction might be counteracting it.

It appears to be a better idea to move the gamma correction closer to the tone mapper, making it part of the tone mapper and at the same time apply gamma locally per pixel.
In fact gamma correction is considered depending on the light adaptation level of the human visual system. The "gamma correction" that is applied by the eye changes the perceived luminance based on the eye's adapatation level [Bartleson 1967] [Kwon 2011].
When the eye is adapted to dark lighting conditions, the exponent for the gamma correction is supposed to increase. If the eye is adapted to bright lighting conditions, the exponent for the gamma correction is supposed to decrease. This is shown in the following image taken from [Bartleson 1967]:
Changes in Relative Brightness Contrast [Bartleson 1967]

A local gamma value can vary with the eye's adaptation level. The equation that adjusts the gamma correction following the current adaptation level of the eye can be found in [Kwon 2011].
γv=0.444+0.045 ln(Lan+0.6034)
For this presentation, this equation was taken from the paper by Kwon et all. Depending on the type of game there is an opportunity to build your own local gamma operator. 
The input luminance value is generated by the tone mapping operator and then stored in the Y channel of the Yxy color space:
γv changes based on the luminance value of the current pixel. That means each pixels luminance value might be gamma corrected with a different exponent. For the equation above, the exponent value is in the range of 0.421 to 0.465. 
Applied Gamma Curve per-pixel based on luminance of pixel
Eye’s adaptation == low - >blue curve
Eye’s adaptation value == high -> green curve
works with any tone mapping operator. L is the luminance value coming from the tone mapping operator. 
With a dynamic local gamma value, the dynamic lighting and shadowing information that is introduced in the pipeline will be considered for the gamma correction. The changes when going from bright areas to dark areas appear more natural. Textures are holding up better the challenges of light adaptation. Overall lights and shadows look better.

Depth of Field
As a proof-of-concept of the usage of Yxy color space and the local dynamic gamma correction, this section is showing screen-shots of a modern Depth of Field implementation with separated near and far field calculations and a shaped Bokeh, implemented in compute.

Producing an image through a lens leads to a "spot" that will vary in size depending on the position of the original point in the scene:
Circle of Confusion (image taken from Wikipedia) 

The Depth of Field is the region, where the CoC is less than the resolution of the human eye (or in our case the resolution of our display medium). The equation on how to calculate the CoC [Potmesil1981] is:

Following the variables in this equation, Confetti demonstrated in a demo at GDC 2011 [Alling2011] the following controls:
  • F-stop - ratio of focal length to aperture size
  • Focal length – distance from lens to image in focus
  • Focus distance – distance to plane in focus
Because the CoC is negative for far field and positive for near field calculations, separate results are commonly generated for the near field and far field of the effect [Sousa13].
Usually the calculation of the CoC is done for each pixel in a down-sampled buffer or texture. Then the near and far field results are generated. Then, first, the far and focus field results are combined and then this result is combined with the near field, based on a near field coverage value. The following screenshots show the result of those steps, with the first screenshot showing the near and far field calculations:

Red = max CoC(near field CoC)
Green = min CoC(far field CoC)

Here is a screenshot of the far field result in Yxy:

Far field result in Yxy

Here is a screenshot of the near field result in Yxy:
Near field result in Yxy

Here is a screenshot of resulting image after it was converted back to RGB:
Resulting Image in RGB

A modern Post-Processing Pipeline can benefit greatly from being run in a color space that offers a separable luminance channel. This opens up new opportunities for an efficient implementation of many new effects.
With the long-term goal of removing any global tone mapping from the pipeline, a dynamic local gamma control can offer more intelligent gamma control that is per-pixel and offers a stronger contrast of bright and dark areas, considering all the dynamic additions in the pipeline.
Any future development in the area of Post-Processing Pipelines can be focused on a more intelligent luminance and color harmonization.

[Alling2011] Michael Alling, "Post-Processing Pipeline",
[Bartleson 1967] C. J. Bartleson and E. J. Breneman, “Brightness function: Effects of adaptation,” J. Opt. Soc. Am., vol. 57, pp. 953-957, 1967.
[Day2012] Mike Day, “An efficient and user-friendly tone mapping operator”,
[Engel2007] Wolfgang Engel, “Post-Processing Pipeline”, GDC 2007
[Kwon 2011] Hyuk-Ju Kwon, Sung-Hak Lee, Seok-Min Chae, Kyu-Ik Sohng, “Tone Mapping Algorithm for Luminance Separated HDR Rendering Based on Visual Brightness Function”, online at
[Potmesil1981] Potmesil M., Chakravarty I. “Synthetic Image Generation with a Lens and Aperture Camera Model”, 1981
[Reinhard] Erik Reinhard, Michael Stark, Peter Shirley, James Ferwerda, "Photographic Tone Reproduction for Digital Images",
 [Sousa13] Tiago Sousa, "CryEngine 3 Graphics Gems", SIGGRAPH 2013,