Diary of a Graphics Programmer

Catching Up / History of The Forge / GPU Zen / Ray Tracing / Holiday Dinner

2020-11-20T08:48:00.002-08:00

I just realized I haven't posted here since 2018. There is generally less sharing of information happening now compared to let's say 10 years ago and I seem to have become one of the people who shares less. In my defense, I can bring up good reasons :-) :

Being part of an ever-growing company with increasing business and HR needs makes it harder for me to focus on the technical aspects of our work in a blog post. Dealing with the development of The Forge, H1-B or O1 Visas, 401k plans, lunch reimbursements, deciding what health insurances to offer, writing contracts and invoices and keeping track of all the money movements, deciding where the holiday dinner should happen, and spending time with interviewing all the new hires, the landlord of our office building, bookkeepers, tax advisors, IP lawyers, and other lawyers leaves much less time for this. Before I can write the blog post I need to decide how we will have a Holiday dinner this year.

With COVID-19 I also began training in two more Martial Arts after practicing Tang Soo Do for more than 11 years: Gum Do and Tai Chi. When working from home it helps with the stress levels to just step outside and listen to the wooshing sound of a Sword splitting the air. Tai Chi is something I do together with my wife, which makes this activity even more valuable.

So now that my defense was laid out, let's get back to business.

One of the interesting things we did was open-source our internal rendering framework. The company is using an internal framework since day 1 of its existence.

This rendering framework is kind of like the "beehive mind" of the company. Over the years some of the best people in the company were invited to extend this framework and then the whole company benefitted from its existence as a blueprint for typical rendering tasks.

In 2017 we decided to write a new rendering framework from scratch because we wanted to cover the new Graphics APIs better and needed fundamental changes in the architecture of our old framework. This framework was open-sourced at the beginning of 2018 and since then steadily improved. We named it "The Forge" because with its help we can create new tools, game engines, experiences. It is shipping in AAA custom game engines, smaller things like editors or educational apps, in the future on hundreds of million devices as the foundation of business frameworks. We used it also to write a new game engine for Supergiant that shipped Hades so far on PC, macOS, and Switch.

https://github.com/ConfettiFX/The-Forge

With every release of The Forge, I write release notes in the style of blog posts. I offer an opinion on why we implemented things the way they were or often describe what went wrong and how we had to rewrite the same sub-system 3 or more times. I describe our technical successes and our failures :-)

I am hoping the release notes are partially making up for the lack of blog posts here. After all, one can look at the source code in its entirety and see what I am talking about. Something my blog posts didn't always offer in the past. Generally, the lack of source code makes a lot of presentations or descriptions of technical implementations less valuable.

Regarding GPU Zen: helping aspiring and experienced graphics programmers was always a goal of mine. I consider The Forge now more useful than a new edition of GPU Zen. It provides the actual code and you can see it working in the games it shipped with or will be shipping with. Compared to a conference talk or a book chapter, this is really what everyone would want to see. I believe a presentation that outlines a technique accompanied by a math equation doesn't offer that same level of usefulness. Showing source code is the ultimate way to share graphics programming knowledge.
So you can think of "The Forge" in GitHub as the next-gen GPU Zen. That doesn't mean we won't do another GPU Zen in the future. It just means we have to think about the value proposition this future book will offer.

Ray Tracing: my last blog post on this topic made some waves in the industry. At some point, I couldn't really dedicate as much time to this topic as I wanted. It came up in Advisory boards, there were IP related projects we worked on regarding Ray Tracing and we -as an industry- eventually succeeded in getting a more open interface.
My company gave a talk on cross-platform Ray Tracing where the macOS / iOS ray tracing run-time was extended to be on par with the DXR / RTX run-time (available in the GitHub repository). This was mostly meant for tool development but can also be used in a cross-platform game engine. I think it also shows a blueprint of how to do Ray Tracing with the newer interfaces on various platforms.

Holiday dinner: before I wrote this blog post, I organized the first Holiday dinner via Skype. My colleagues in the PST time zone will dial in via Skype. They will get food via DoorDash and the company will reimburse it. This will be a good time to see how everyone's family has grown, what the dogs are doing etc.. :-) I will then repeat a dinner/breakfast with the colleagues in the different time zones we cover.

Ray Tracing without Ray Tracing API

2018-09-07T12:02:00.000-07:00

Following up on my last blog post, where I stated that a Ray Tracing API is bad for game developers and publishers because of the increase in QA effort that it will bring: based on the last 20+ years of graphics development, it is easy to project that when a large part of the ray tracing codebase is owned by a hardware vendor, there will be various bugs introduced with each driver release. For game developers and game engine middleware providers, it will make it expensive to support such an API.

The current main use cases for real-time Ray Tracing in games are Shadows, Reflections, and Ambient Occlusion.
To find out how easy it would be to avoid using the Ray Tracing APIs, we wanted to see how fast "native" implementations would be compared to the ones that are using Ray Tracing APIs.

At the time, Kostas Anagnostou @KostasAAA had experimented with getting hybrid Ray Tracing running on lower-end GPUs. I was talking to him because he was supposed to write an article for GPU Zen 2 about a culling system. I asked him if he would be interested in integrating his experiments in the Confetti rendering framework The Forge, so that we can run them on more platforms like Linux (VLKN), macOS / iOS (Metal 2) and then the XBOX One. When he started working on this, he re-visited some of his former ideas and improved the performance substantially. He wrote a blog post here: Interplay of Light

Today a new release of The Forge came out that supported his approach to Hybrid Ray Traced Shadows. It is running on all platforms that The Forge supports and the iOS version is running astonishingly well.

We have reason to believe that when it comes to hybrid Ray Traced Shadows, at the moment it is better to not use a Ray Tracing API. When using DXR, we expect Hybrid Ray Traced Shadows to run on all hardware apart from the GeForce RTX with subpar performance. We expect our implementation when compared to a DXR version both running on a GeForce RTX, to perform admirably and most important practically useful in games.

Here is a screenshot running on an iPhone 7 with the Sponza scene and iOS 11.4.1 (15G77). Resolution is 1334 x 750.

This approach was developed on a PC running Windows DX12 / VLKN; we just ported the PC version to macOS / iOS and the other target platforms. In case we would spend some time doing iOS specific optimizations there might be opportunities to improve framerate.

The next step is to develop a Hybrid Ray Traced Reflection approach.

The general hope is to make it possible to have Hybrid Ray Traced Shadows, Reflections, and Ambient Occlusion on all GPUs and on all platforms available.

Let's see how far we can get ...

Ray Tracing with the DirectX Ray Tracing API (DXR)

2018-04-05T08:19:00.000-07:00

Experimental DXR support was added to The Forge today. Let's think about this for a second by starting with a few quotes from a now famous book, that many people have read in the last couple of days/weeks. The quotes are on the first page of the book "Ray Tracing in One Weekend" by Peter Shirley:

I've taught many graphics classes over the years. Often I do them in ray tracing, because you are forced to write all the code but you can still get cool images with no API.

Later on the same page, he says:

When somebody says "ray tracing" it could mean many things.

If you read the following pages of this book, you can see that writing a ray tracer might have a low level of complexity. This is why computer graphics students learn ray tracing before they are exposed to the world of Graphics APIs.
So the first question that goes through one's head when looking at DXR is, why do we need now an API for Ray Tracing? The obvious follow-up is, why is that beneficial for game developers?

I could just finish this blog post now and say "there is no benefit for game developers because it restricts the creative process of making games distinguishable and that is obvious" and it will raise the cost of game development because there is one more API to take care off and move on with my day :-) I could say something along the lines of "requesting an API for ray tracing is like selling refrigerators at Antartica" because in this sentence both comparisons have about the same level of complexity.
I could also say something along the lines of "you want to ruin the whole computer graphics experience by letting them learn a ray tracing API first? Do you not have any heart?".

Instead, let's just ask the question why do we need a ray tracing API?

Hardware vendors will say: because we have to map the complexity of ray tracing to our hardware; we need an API. So far we haven't dedicated any silicon to ray tracing (except PowerVR) because rasterized graphics are commercially at this moment more successful. Just in case this time (compared to the 10+ times before) it catches on, we are promising to add ray tracing silicon as long as you let us hide all this for a while behind an API.

OS vendors will say: we want to be at the forefront of development and we want to bind ray tracing to our platform, let's define an API that everyone has to use to bind them also to our platform. If they want to develop for a different platform they will have to rewrite all their code or use a different API that is hopefully less powerful than ours and those two facts might keep them from doing it ...

Then there are game development studios and then graphics programmers. Commonly described as wise and matured, drawing pictures of cats or flowers on paper and screens and living out their creative energies by writing ray tracing code at ShaderToy. ShaderToy is a showcase of the possibilities of ray tracing and all the opportunities it might have in games. It shows the wide range of approaches with their respective pros and cons and what creative people can really achieve if you offer them the freedom to do it.

Making games distinguishable on a visual level is a common requirement, similar to movies. There is not one ray tracing technique that can take over the role of being generic. There is no way any API could settle on a subset of ray tracing that could be acceptable for a large group of game developers. We *might* all be able to agree on a BVH structure but there is where it ends. Why would any game developer want to black-box ray tracing. This is comparable to say the following: "a game developer only needs one pixel shader for a material like metal, one pixel shader for a material like skin and all the graphics card drivers only need to be optimized for exactly those pixel shaders. That would make life so much easier.". I put this sentence in quotes because this actually happened not long ago and the rationalization was publicly expressed. Game developers were able to prevent this from happening.
As a game developer and graphics programmer, my interest is always in making commercially successful products that are appealing to gamers and distinguishable from their competitors in appearance and gameplay. Deciding about the fate of ray tracing by creating an API that black boxes parts of it is counterproductive to this effort and not in the interest of small or big game developers like EA, Ubisoft, R* and others.

Then there is the cost factor. Supporting another graphics API will be expensive. As usual, the expense is not in the initial implementation but in the maintenance and QA. So in case someone licenses middleware, the cost of maintenance is still there.Treating ray tracing as an additional feature set to the common graphics APIs should be cheaper.

DXR is in the proposal stage at the moment. Microsoft expressed interest in getting feedback from developers and they would like to change it. I would like to encourage people and game development companies to raise their concerns.
On a technical level, I would prefer to extend the existing APIs "enough" and offer more flexibility through additional features like for example a ray tracing feature level, instead of adding a black box for ray tracing. As soon as the special hardware is available it can be exposed through extensions as well.

This will give game developers the creative freedom they need and at the same time offer the opportunity to invest into it easier.

Addendum 1:
One more thought added: in the moment if you want to ship a game on the PC, there is a high chance you will have to go to a hardware vendor to ask for driver updates or performance improvements. If Ray Tracing (RT) gets its own API, we will have to ask for driver updates for RT as well. Obviously, hardware vendors might want something in return for fixing drivers or making sure your RT code runs fast. This is how it works on PC with Graphics APIs.

We work quite often with smaller developers and bigger developers. Bigger developers just go to the hardware vendor and ask for driver updates and dependent on how "important" their game is, they will get them. Smaller developers hardly get noticed in the process. Over the years most of us agreed that the driver / API situation is not good; Now if we add a RT API, we are creating the same ecosystem for a second API on PC ....

On an economic level, this is very much noticeable by publishers and therefore we can explain this to Bethesda, EA, R* and others. If a game launches on a "broken" driver, the sales of that game will be lower. A game publisher can predict the amount of money that a RT API will cost them during the launch of a game. If we add a RT driver/API we have two opportunities to tank sales at the beginning of the lifetime of a game. Most of us saw their game launching with "broken" drivers. Now extend the experience to a second API.

From that perspective an RT driver is a huge economic factor that is put on the shoulders of the game developers ... you could say whoever wants to add another driver to the PC ecosystem increases the cost of game development on PC substantially ... although it is hard to specify how much.

Triangle Visibility Buffer

2018-03-30T13:13:00.014-07:00

A Rendering Architecture for high-resolution Displays and Console Games

-----------------------------------

Document History:

- Initial Published March 30th, 2018

- Updated January 22th, 2021

- Updated June 4th, 2021 with a simplified degenerate triangle removal

- Updated June 12th, 2021 links to the new Forge Shader Language shaders should work now

-----------------------------------

The "Triangle Visibility Buffer" is a research project at our company since September 2015. This blog entry serves the purpose of outlining the current status.

We called it Triangle Visibility Buffer because this rendering technique is keeping track of triangles through the whole rendering pipeline and stores visibility of every opaque triangle in the scene in a buffer. The technique is very suitable for target hardware platforms that have only limited amounts of high-speed memory to store render data but need to support large resolutions. It is also most suitable for rendering on high-resolution displays with 4k, 5k, or 8k resolutions.
You can find the source code accompanying this blog entry at

https://github.com/ConfettiFX

In this repository, there are PC DirectX 12 / Vulkan, Linux Vulkan and macOS / Metal 2 implementations available. On request, there is also source code for various console platforms available. The following text will only refer to the DirectX 12 implementations for consistency. Finding the Vulkan and macOS counterparts in the code base is left to the reader.

Following the data flow in the Triangle Visibility Buffer rendering pipeline, the first stage is the triangle removal stage, which culls invisible triangles from the data set.

Multi-View Triangle Cluster Culling / Triangle Filtering

The number of polygons increases every year in games. Hardware can become bottlenecked in the Command Processor, in case empty draw calls are spawned, in the vertex shader with the number of vertices to transform, with backface culling and clipping and or in the rasterizer because small triangles that are not visible make it primitive bound.
To reduce the number of triangles that are going into the graphics pipeline in general, a triangle removal stage is added as a first step to the whole rendering system. This way the graphics pipeline can be better utilized with the visible triangles.
Triangle removal is not a new concept and was utilized on certain platforms already a decade ago or probably even longer. Due to the triangle complexity of modern games, it was revived in recent years in talks by [Chajdas][Wihlidal] because modern hardware seems to benefit from it.
The techniques used in the demo to remove triangles consist of the following consecutive stages:
- Cluster culling: cull groups of 256 triangles with a similar orientation on the CPU
- Triangle filtering: cull individual triangles in an async compute shader individually
- Draw call compaction: remove empty draw calls with no triangles left and order the remaining draw calls sequentially in a compute shader

The demo implementation does all the above for the main camera view and the shadow map view at the same time. We call this Multi-View Triangle removal.

Triangle Cluster Culling on the CPU

Triangle cluster culling -running in this case on the CPU- removes invisible chunks of 256 triangles with similar orientation. This is done by picking the first 256 triangles of a mesh and then testing them against a visibility test cone. In the following Image 1, the triangles and the face normals of those triangles are represented by the orange lines.

Image 1 - Triangle Cluster Culling - The Test Cone

The light blue triangle is the test cone. In case the eye or the camera is inside this test cone, the triangles are considered not visible. To find the center of the cone, we start out by accumulating the face normals of the triangle cluster negatively as shown in Image 2.

Image 2 - Triangle Cluster Culling - Negatively accumulating Triangles to find the cone center

To calculate the cone open angle, the most restrictive triangle planes are taken from the triangle cluster as shown in Image 3.

Image 3 - Triangle Cluster Culling - Calculating the Test Cone Planes

If the camera or eye is in the area of the test cone, the triangle cluster is not visible and can, therefore, be removed.
The effectivity of this simple cluster culling mechanism depends on the scene data. In case there are a lot of triangles that face in a similar direction, it will be more efficient, in case triangles are facing in different directions, it is lower.
The test scene -San Miguel- in the Visibility Buffer demo doesn’t have many clusters of triangles that are facing in a similar direction and therefore triangle cluster culling is not very efficient. This might be different with geometry that is tessellated by hardware. By default, triangle cluster culling is due to poor efficiency switched off in the demo.
The code for cluster culling can be found in Visibility_Buffer.cpp, and therein triangleFilteringPass() and then cullCluster().

Triangle Filtering on the GPU

To remove invisible triangles, an async compute shader is executed on each triangle. It runs 256 triangles in one batch and tests if triangles are

Degenerate
Back-facing
Clip through the near clipping plane of the view frustum
Clip through or are outside of the view frustum
Are too small to cover a pixel center or sampling point

Triangles that pass all tests will be appended to an index buffer. This index buffer is then called a filtered index buffer.
All the source code for this section can be found in the shader triangle_filtering.comp.fsl and in there in the function FilterTriangle().

Degenerate and Back-facing Triangles

Triangles that face away from the viewer are not visible and therefore need to be culled. This implementation uses a technique described by [Olano]. He calculates the determinant of the 3x3 clip-space matrix consisting of the three vertices of the triangle. In case the determinant is larger than 0, it has no inverse and therefore is back-facing and can be culled.

If the determinant is 0, the triangle is degenerate, or is being viewed edge-on and has zero screen-space area. Degenerate triangles might be introduced in hardware tessellation or in case there is a bug in the art asset pipeline.

Image 3 - Backface Culling

#if ENABLE_CULL_BACKFACE
 // Culling in homogenous coordinates
 // Read: "Triangle Scan Conversion using 2D Homogeneous Coordinates"
 //       by Marc Olano, Trey Greer
 //       http://www.cs.unc.edu/~olano/papers/2dh-tri/2dh-tri.pdf
 float3x3 m = float3x3(vertices[0].xyw, vertices[1].xyw, vertices[2].xyw);
 if (cullBackFace)
   cull = cull || (determinant(m) >= 0);
#endif

Back-face culling can potentially remove 50% of the geometry.

Near Plane Clipping

Triangles that are in front of the near clipping plane of the view frustum need to be culled. To check, if the triangle is in front of the near clipping plane, the following code checks if the w component of each vertex is below zero. In case this is true, it flips the w component, to make sure it is not projected on two sides of the screen.

for (uint i = 0; i < 3; i++)

{

if (vertices[i].w < 0)

{

++verticesInFrontOfNearPlane;

// Flip the w so that any triangle that straddles the plane
// won't be projected onto two sides of the screen

vertices[i].w *= (-1.0);
}
…
}

If all three vertices of the triangle are in front of the near clipping plane, the triangle gets culled:

if (verticesInFrontOfNearPlane == 3)

return true;

Frustum Culling

Triangles whose vertices are all on the negative side of the clip-space cube are outside the view frustum and therefore can be culled.

The following image 4 shows the camera situated close to a table in the San Miguel scene and the remaining triangles after frustum culling.

Image 4 - Triangle Frustum Culling

The demo code in the GitHub repository allows to freeze triangle frustum culling and then to move the camera away to see the results.
To make the comparison against the clip-space cube more efficient, all the vertices of a triangle are transformed into the normalized 0..1 space first.

...

vertices[i].xy /= vertices[i].w * 2;

vertices[i].xy += float2(0.5, 0.5);

…

If any vertices of a triangle are outside of this 0 .. 1 range, it will be culled.

...

float minx = min(min(vertices[0].x, vertices[1].x), vertices[2].x);

float miny = min(min(vertices[0].y, vertices[1].y), vertices[2].y);

float maxx = max(max(vertices[0].x, vertices[1].x), vertices[2].x);

float maxy = max(max(vertices[0].y, vertices[1].y), vertices[2].y);

if ((maxx < 0) || (maxy < 0) || (minx > 1) || (miny > 1))

return true;

...

Small Primitive Culling

Triangles are considered too small, in case they do not overlap with a pixel center or a sample point after projection.

The following image shows very small triangles in-between sampling points:

Image 5 - Small Primitive Culling

Although the triangles in image 5 are too small to be visible, the rasterizer might still spend cycles dealing with them. In case the GPU can only deal with one primitive per cycle per tile, the cost of even one invisible triangle can become high.

This is why small triangles need to be removed before they hit the graphics pipeline.

The following code for the small triangle test generates a bounding box in screen-space around a triangle, then uses the x and y value of this bounding box to see if it overlaps a pixel center or sampling point in the case of MSAA.

// Scale based on distance from center to msaa sample point

int2 screenSpacePosition = int2(screenSpacePositionFP * (SUBPIXEL_SAMPLES * samples));

minBB = min(screenSpacePosition, minBB);

maxBB = max(screenSpacePosition, maxBB);

if (any(((minBB & SUBPIXEL_MASK) > SUBPIXEL_SAMPLE_CENTER) &&

((maxBB - ((minBB & ~SUBPIXEL_MASK) + SUBPIXEL_SAMPLE_CENTER)) <
(SUBPIXEL_SAMPLE_SIZE))))

{

return true;

}

Multi-View Triangle Removal

Triangle removal comes at the cost of loading for every triangle the index and vertex data, transform vertices, and then, later on, append the triangle data to the filtered index buffer. It appears that the cost of accessing the triangle data seems to be higher than the cost of running the visibility tests.
As long as the numbers of triangles in the scene are high, this cost should be offset by the gains on modern GPUs.
One way to amortize this cost, even more, is to remove invisible triangles for several views -like a main camera view, a shadow map view, reflective shadow map view, etc.- at the same time.
That means if a triangle is visible in the main camera view but not in the shadow map view, it is considered visible in both views. In other words, the remaining set of visible triangles will be the least common denominator between all views; reducing the effectiveness of triangle removal.
Although the overall number of triangles that are removed with a multi-view triangle removal stage is smaller compared to just removing triangles for each view separately, huge performance gains are achieved by just loading the triangle data only once in that case.
Here is the source code in triangle_filtering.comp.fsl that executes FilterTriangle() for several views:

for (uint i = 0; i < NUM_CULLING_VIEWPORTS; ++i)

{

float4x4 worldViewProjection = uniforms.transform[i].mvp;

float4 vertices[3] =

{

mul(worldViewProjection, vert[0]),

mul(worldViewProjection, vert[1]),

mul(worldViewProjection, vert[2])

};

CullingViewPort viewport = uniforms.cullingViewports[i];

cull[i] = FilterTriangle(indices, vertices, !twoSided,
viewport.windowSize, viewport.sampleCount);

if (!cull[i])

InterlockedAdd(workGroupIndexCount[i], 3, threadOutputSlot[i]);

}

Multi-View Triangle Removal - Results

The San Miguel scene used in the demo has around 8 million triangles. When the demo starts up in the default camera view - shown in image 6-, after multi-view triangle removal, the filtered index buffer for the shadow map view indexes 1.843 million triangles, while the filtered index buffer for the main view indexes 2.321 million triangles.

Image 6 - Default start-up view of the Visibility Buffer demo

Triangle removal as described here or similar approaches are now used by every major developer in next-gen rendering systems and future graphics API design is picking up this idea and might improve geometry handling more.

Draw Call Compaction

The async compute shader for Triangle Filtering runs on batches of 256 triangles as described above. This stage removes all non-visible triangles by appending only visible triangles to the “filtered index buffer”.

This might lead to a situation where the removal of triangles ends up creating an empty draw call:

Batch0 - start index: 0 | num of indices: 12
Batch1 - start index: 12 | num of indices: 256
Batch2 - start index: 268 | num of indices: 120
Batch3 - start index: 388 | num of indices: 0 (empty batch)

In this list, Batch3 ends up being empty.

These "holes" impact performance since the GPU command processor has to do all the work setting up data for that draw call, which is wasted work if it is empty.

To fix the "holes", there is another pass called batch compaction in the shader batch_compaction.comp.fsl.

This shader removes empty draw calls and aligns the remaining ones so that the ExecuteIndirect call is efficient.

Image 7 shows the flow from triangles that are removed with culling tests to draw calls that need to be compacted.

Image 7 - Draw Call Compaction

The batch compaction compute shader checks if the draw call is empty and then removes those calls by filling a new draw argument buffer with only usable draw call data. This new draw argument buffer is later used by ExecuteIndirect.

This same shader also fills a per draw indirect material buffer which holds the material index for each draw call. It also determines the overall number of draw calls that will be passed as the final draw counter to the ExecuteIndirect call.

On a high-level view, the Triangle Visibility Buffer rendering system went through the following stages so far:

[CPU] Early discard geometry not visible from any view using cluster culling

[CS] Generate N index and N ExecuteIndirect buffers by culling and filtering triangles against the N views (one triangle per compute shader thread)

[CS] Draw call compaction

For each, i view use ith index buffer and ith indirect argument buffer

With all the draw data optimized for usage, the next stage is filling the actual Visibility Buffer with ExecuteIndirect.

Filling the Visibility Buffer - ExecuteIndirect

The Triangle Visibility Buffer will hold indices into triangle data in an 8:8:8:8 render target similar to [Burns][Schied]. The index consists of a packed 32-bit value:

1-bit Alpha-Masked
In the demo, one bit holds information on if the geometry requires alpha masking or not. The PC requires a dedicated code path for each with its own ExecuteIndirect.
8-bit drawID - indirect draw call id
An 8-bit value represents the id of the draw call to which the triangle belongs. In this implementation, there is space for 256 draw calls
23-bit triangleID
A 23-bit value holds an id that describes the offset of a triangle inside a draw call. In other words, it is relative to the drawID.

The render target holding this data is filled with ExecuteIndirect calls in parallel with the Depth Buffer.
All ExecuteIndirect calls read vertex buffers, index buffers, and a material buffer, that is used to apply various materials.
There are four different vertex buffers:

Position
Texture coordinates
Normals
Tangents

Separating vertex data into four buffers (also called non-interleaved vertex data) turned out to be more efficient due to position and texture coordinates being used more often than normals and tangents.
Looking at the separate stages:

Triangle filtering uses

Position

Filling the Visibility Buffer uses

Position
Texture coordinates for alpha testing

Shading uses

Position
Texture coordinates
Normals
Tangents

The ExecuteIndirect calls also expect index buffers that are used to index into the vertex buffers. This demo is using six “filtered” index buffers that were generated during triangle removal by appending only visible triangles to them. There are two sets for the camera view and the shadow map view of three index buffers for triple buffering the swap chain. The triple buffer was necessary for the async compute shader used in triangle removal.
The ExecuteIndirect calls also expect “filtered” indirect argument buffers that were generated during the draw call compaction stage after triangle removal.
The last buffer fed to the ExecuteIndirect calls is the texture id or material buffer (also generated during draw call compaction), which is used to represent a wide range of materials in the scene.
All the source code can be found in Visibility_Buffer.cpp and there in drawVisibilityBufferPass() and in visibilitybuffer_pass.frag.fsl.

In the San Miguel test scene, the number of indirect draw calls in each of the four ExecuteIndirect calls are:

Shadow opaque: 163
Shadow alpha masked: 50
Main view opaque: 152
Main view alpha masked: 50

As soon as these four ExecuteIndirect calls have finished, the Visibility Buffer and the Depth buffer are filled with one layer of triangles and one layer of pixels. In other words overdraw of triangles and pixels is removed for opaque geometry.
The demo holds implementations for the described Visibility Buffer approach and a G-Buffer based Deferred Shading approach. The way the G-Buffer is filled resembles the way the Visibility Buffer is filled. The main difference is the memory usage patterns.

Memory Usage - Visibility Buffer vs. G-Buffer

Memory bandwidth is one of the more limiting factors for the performance of games, especially on lower-end platforms or on platforms that need to support 4k and higher resolutions.
The increasing size of G-Buffers during the last 10 years makes the commonly used Deferred Shading techniques more bandwidth-hungry.
Games use vertex and index buffers, other data like textures, draw arguments, uniforms, descriptors, and then render targets. Render targets scale with screen size and for larger screen resolutions represent a very large part of the memory occupied during rendering.
One of the advantages of the Visibility Buffer is that it fits into two 32-bit render targets (Triangle Visibility in 32-bit and depth visibility in 32-bit as well). The following text will compare the memory usage of the demo implementation of the Visibility Buffer and the G-Buffer implementation.
The usage of vertex and index buffers to feed the ExecuteIndirect calls are the same in the Visibility Buffer and the G-Buffer implementation as shown in Image 8:

Image 8 - Memory usage of the Vertex and Index Buffers

Additionally, there is data used for textures (roughly 21 MB), draw arguments, uniforms, descriptors etc. (roughly 2 MB). From a memory perspective, the most interesting memory is the one that is used for screen-space render targets. Image 9 shows the render target memory occupied with a resolution of 1080p and various MSAA settings for the Visibility buffer:

Image 9 - Visibility Buffer Memory at 1080p

Image 10 shows the render target memory occupied in a resolution of 1080p and various MSAA settings for a G-Buffer:

Image 10 - G-Buffer Memory at 1080p

Comparing the 1080p memory numbers, the G-Buffer with 2x and 4x MSAA more than doubles in size as expected from going from two 32-bit render targets to five.

With a monitor or TV supporting 4k (3840x2160) the delta between the G-Buffer compared to the Visibility Buffer becomes bigger as shown in Image 11 and 12:

Image 11 - Visibility Buffer Render Target memory at 4k

Image 12 - G-Buffer Render Target memory at 4k

The numbers provided are only estimates on PC because the driver and the way memory is fragmented might change how much one render target actually occupies.
These numbers show how filling and reading a G-Buffer with large screen resolutions for Deferred Shading can become a memory bandwidth bottleneck, depending on the memory bandwidth of the used GPU. This becomes even more dramatic with 5k and 8k displays.
In other words: one motivation to implement a Visibility Buffer-like approach is to reduce memory bandwidth on high-res displays on platforms that do not have much high-speed memory, like hardware-tiled platforms or some console platforms.

Shading

After the Visibility Buffer is filled with one layer of triangles for the opaque pass, and the depth buffer is filled with one layer of pixels, the scene can be shaded. To prepare for shading the scene, a list of lights per screen-space tile is generated upfront (Tiled Light List).

Tiled Light List

To deal with a large number of lights, the demo implementation splits the screen-space into tiles and identifies lights that need to be rendered in those tiles. In the actual shading pass, this light list will be used to do one screen-space lighting pass for all light sources for opaque and transparent objects.
To generate this list, a compute shader runs on 64 lights per tile. It compares the bounding volume of the light with its x and y-direction to the x and y-direction of the tile in screen-space, in case it is in the tile, it adds the light to the light cluster and increases the light count for that cluster. There is also an early out for lights that are behind the camera.
The source code for generating the list of lights in those tiles can be found at cluster_lights.comp.fsl.

Forward++

Because the lighting technique uses the Visibility Buffer with its one layer of optimized triangle data in one screen-space pass, we call it Forward++ compared to Forward+ that would use several draw calls.

Image 13 - Shading the Visibility Buffer with Forward++

Image 13 shows the Visibility Buffer and the Depth buffer at the top. The various vertex and index buffers used for shading on the right. On the left is the tiled light list that is used to apply a large number of lights per tile. For transparent objects, we still have to use traditional Forward+ by sorting draw calls back-to-front before we execute them.

On a high level, the shading algorithm goes through the following steps:

Get drawID/triangleID at screen-space pixel position
Load data for the 3 vertices from the IB and then the VB
Compute the partial derivatives of the barycentric coordinates – triangle gradients
Interpolate vertex attributes at pixel position using gradients
Calculate Directional light contribution (in the demo either Blinn-Phong or PBR)
Add point light contributions by going through the tiles of the tiled light list

The source code for applying the lights is in visibilityBuffer_shade.frag.fsl.

To calculate the partial derivatives, we are using the following equation from [Schied] in Appendix A Equation (4):

Equation 1 - Partial Derivatives

The implementation of this equation looks like this:

// Computes the partial derivatives of a triangle from the projected
// screen space vertices

DerivativesOutput computePartialDerivatives(float2 v[3])

{

DerivativesOutput output;

float d = 1.0 / determinant(float2x2(v[2] - v[1], v[0] - v[1]));

output.db_dx = float3(v[1].y - v[2].y, v[2].y - v[0].y, v[0].y - v[1].y) * d;

output.db_dy = float3(v[2].x - v[1].x, v[0].x - v[2].x, v[1].x - v[0].x) * d;

return output;

}

The partial derivatives in this code are calculated without intrinsics to preserve as much precision as possible.
The actual shading code is rather straightforward. The directional light is applied first and the point lights are applied later in a for loop depending on their visibility in the screen tiles:

…

// directional light

shadedColor = calculateIllumination(normal, uniforms.camPos.xyz, uniforms.esmControl,
uniforms.lightDir.xyz, isTwoSided, posLS, position, shadowMap,
diffuseColor.xyz, specularData.xyz, depthSampler);

// point lights

// Find the light cluster for the current pixel

uint2 clusterCoords = uint2(floor((input.screenPos * 0.5 + 0.5) *
float2(LIGHT_CLUSTER_WIDTH, LIGHT_CLUSTER_HEIGHT)));

uint numLightsInCluster = lightClustersCount.Load(LIGHT_CLUSTER_COUNT_POS(clusterCoords.x,
clusterCoords.y) * 4);

// Accumulate light contributions

for (uint i = 0; i < numLightsInCluster; i++)

{

uint lightId = lightClusters.Load(LIGHT_CLUSTER_DATA_POS(i, clusterCoords.x,
clusterCoords.y) * 4);

shadedColor += pointLightShade(lights[lightId].position, lights[lightId].color,
uniforms.camPos.xyz, position, normal,
specularData, isTwoSided);

}

This code and the setup might likely change in future iterations of the demo. Any Ray Tracing code might benefit from the existence of partial derivatives and the fact that the visibility of triangles is optimized in the Visibility Buffer.

Visibility Buffer - Benefits

Comparing the Visibility Buffer to a Deferred Shading system with a large G-Buffer shows the following benefits.

Memory Bandwidth

Due to the smaller render target memory footprint, the Visibility Buffer offers memory bandwidth benefits compared to a G-Buffer. This becomes eminent in scenarios where the screen resolution is high or where the amount of fast memory is so limited that only two 32-bit render targets or even a tiled region of render targets fit.

Memory Access Patterns

When shading happens, triangle data is fetched from the filtered index buffer in the Visibility Buffer. The actual fetch of data from the index/vertex buffers happens similar to a regular draw call but continuously in screen-space once. In other words, the memory access of index and vertex buffers apart from the indirection through the Visibility Buffer is the “optimal” access pattern that the architects of GPUs had in mind. Compared to a regular forward renderer this only happens once for opaque objects in screen-space and not for several draw calls.
We see highly coherent cache hit rates of 99% L2 cache hits for textures, vertex, and index buffers. Therefore lighting the triangles appears to be fast.
To apply a light in a G-Buffer, a larger memory area has to be accessed due to the more redundant nature of screen-space data.

These two benefits are underlined by the performance measurements shown below.

Material Variety

The Visibility Buffer can represent a much wider range of materials due to the fact that material parameters do not have to be stored per-pixel in a G-Buffer. All the lessons learned from using materials in forward renderers need to be extended by the idea that the Visibility Buffer uses bindless texture arrays, other than that it should be the same.

Miscellaneous

There are several questions that usually come up in discussions about the Triangle Visibility Buffer implementation.

Why didn’t we implement this earlier?

What is described here was not a straightforward process of implementing one paper. We started out with [Schied] in September 2015. Christoph Schied came to our office and implemented his approach in our old rendering framework at that point in time in OpenGL. We then simplified everything over the following 2 ½ years to a point our approach transformed into the approach taken by [Burns]. Compared to [Burns], the actual storage of triangles in the Visibility Buffer happens due to the triangle removal and draw compaction step with an optimal “massaged” workload set, with ExecuteIndirect reducing CPU overhead. Because this requires a compute shader, it was not possible at the time.

After the Visibility Buffer is filled with one layer of triangles and the depth buffer holds one layer of pixels, the now one-time screen-space shading can be executed faster compared to Deferred Shading and a Forward Renderer due to better memory access patterns.

[Burns] couldn't use compute shaders and therefore a tiled light list was not possible.

How often do you have to skin animated objects?

There are three stages that transform vertices for triangle removal, filling the Visibility Buffer and then shading. After the triangle removal stage, the transformation has -hopefully- only to happen on less than half of the triangles compared to triangle removal.
To reduce the number of times that a triangle has to be transformed, in a future iteration of the demo application, pre-transformation of triangles will be implemented.

How about Deferred Decals?

If you still use a Decal system it might be time to switch to an async compute-driven texture synthesis system. Other than that the equivalent of Deferred Decals can be implemented after the Visibility Buffer fill, fetching triangle and normal data from the Visibility Buffer and applying the end result in the back-buffer similarly to a Deferred Decal system.

Performance Numbers

Over the years, we collected performance numbers on various platforms ranging from console platforms to macOS and now PC with DirectX 12 and Vulkan. The Visibility Buffer demo allows switching between a Deferred Shading implementation with a G-Buffer that resembles what is used in games and the actual Visibility Buffer implementation.
Both are similar when it comes to how the data is set up to be rendered into the Visibility Buffer / Buffer. So both use the ExecuteIndirect setup described above on all platforms. The main difference is the usage of the G-Buffer.
Below are performance numbers for the DirectX 12 implementation running at 4k from an older version of the codebase.

Image 14 - Visibility Buffer Performance Numbers

Image 15 - Deferred Shading Performance Numbers

The column that is named “Culling” shows the performance cost of triangle culling and filtering. Most of the other columns are self-explanatory.
With increasing screen resolution, the difference in performance between a G-Buffer and the Visibility Buffer becomes apparent. The difference translates also to console platforms in 1080p and 4k resolutions.

Future

For future iterations of the Visibility Buffer, we are looking at Physically Based Materials, Ray Tracing and Object-Space Shading. In case we find any noteworthy results, they will be shared in another blog post.

Credits

Like all the work at our company, a research project like this for such a long time is touched by a large number of people. In no particular order, there was Leroy Sikkes, Jesus Gumbau, Thomas Zeng, Max Oomen, Jordan Logan, Marijn Tamis, David Srour, Manas Kulkarni, Volkan Ilbeyli, Andreas Valencia Telez, Eloy Ribera, Antoine Micaelian who worked at one point or another on this project. In case I forgot someone, I will add the person … let me know :-)

Update: it is the year 2021: this list can be extended by about 30 more people. Everyone who ever worked at our company worked on this in one or two ways since 2015 ... we also need to thank more companies to support this research: Apple, AMD, INTEL, Google, and I am forgetting probably a few, they span off projects with us over the years. Thanks for all the support!

We are using GeometryFX and the Vulkan Memory Manager from AMD and many other open-source libraries. We want to thank all the open-source contributors for sharing their code and knowledge. Without these contributions writing your own game engine with a framework like the Forge wouldn’t be as easily possible. We are hoping that this spirit lives on and others are encouraged to do the same.

References

[Burns] Christopher A. Burns, Warren A. Hunt “The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading” Journal of Computer Graphics Techniques (JCGT) 2:2 (2013), 55- 69. Available online at http://jcgt.org/published/0002/02/04
[Chajdas] Matthaeus Chajdas “GeometryFX” http://gpuopen.com/gaming-product/geometryfx/
[Engel2009] Wolfgang Engel, “Light Pre-Pass”, “Advances in Real-Time Rendering in 3D Graphics and Games”, SIGGRAPH 2009, http://halo.bungie.net/news/content.aspx?link=Siggraph_09
[Lagarde] Sebastien Lagarde, Charles de Rousiers, “Moving Frostbite to Physically Based Rendering”, Course notes SIGGRAPH 2014
[Olano] Marc Olano, Trey Greer, “Triangle Scan Conversion using 2D Homogeneous Coordinates”, https://www.csee.umbc.edu/~olano/papers/2dh-tri/
[Schied2015] Christoph Schied, Carsten Dachsbacher “Deferred Attribute Interpolation for Memory-Efficient Deferred Shading”, http://cg.ivd.kit.edu/publications/2015/dais/DAIS.pdf
[Schied2016] Christoph Schied, Carten Dachsbacher “Deferred Attribute Interpolation Shading”, GPU Pro 7, CRC Press
[Wihlidal] Graham Wihlidal, “Optimizing the Graphics Pipeline with Compute”, GDC 2016, http://www.frostbite.com/2016/03/optimizing-the-graphics-pipeline-with-compute/

HDR10 - TV setup

2017-04-07T17:20:00.000-07:00

We have a large amount of HDR (High-Dynamic Range) TVs in the office due to our work on HDR standards. We run HDR capable content on those TVs frequently.

I was recently showing one of our non-HDR demos at a conference. The conference organizer was very nice and they provided an HDR10 TV to us. I was grateful for that because that meant we didn't have to ship a large and heavy TV over a large distance.
Usually, I am not the one who sets up our TVs. One of my co-workers with more experience is dealing with this type of work.

The demo I was going to show didn't have any special HDR treatment. It is not supposed to showcase the advantages of a higher dynamic brightness range. So all the art assets are LDR and there was no effort made in making it look beautiful on an HDR TV. It was demoed on many LDR TVs and monitors before.

I was stunned to see how ugly this demo looked on this particular HDR10 capable TV. Up until then it didn't occur to me that we would have to make changes to the art assets or add a tone mapper and tune it (the demo didn't have one) to make a LDR demo look good on a HDR10 capable TV. My assumption was that the LDR demo should still look good on an HDR TV after all LDR is a subset of HDR right?

Also as a programmer, I consider going from LDR - > HDR a solved problem ... at least it should be documented enough and therefore didn't expect any challenge from just outputting our little demo that was already used on many LDR monitors and TVs before on an HDR10 TV.

So after my initial shock, a friendly person suggested to go to the following website to tune the TV:

http://www.rtings.com/tv/reviews/vizio/m-series-2015/settings

At that point, it occurred to me that an HDR10 capable TV has dozens of sliders in several sub-menus. I had a hard time to understand what all these sliders are doing and how they interact. There is no way an end-user will go through them and understand 5% of them.

How do we make sure that our games run on a wide range of HDR10 capable TVs consistently? Will every game come with a handbook explaining the best settings for a few dozen TVs? We can not possibly assume any end-user will go through the exercise of figuring this out. Previous experiences with just adjusting the gamma value indicate a low "adoption rate" of menu options.

Will we give recommendations like we do for PC Graphics Cards now, saying this game is best used with this display and expect users to upgrade their displays for our game? What happens if two games recommend different displays, will users need to have different displays in their main living room ... obviously, I am exaggerating.

For gamma, we invented our own gamma calibration test screens that worked well. Is there a way to do this for color as well? Maybe color wheels? This way we can guide users that are inclined to work their way through the menu options to a more optimal image?

Anyone has already thought this through?

GDCE 2016 - The filtered and culled Visibility Buffer

2016-08-22T15:48:00.003-07:00

Here is the executive summary: we built a rendering system that

Cluster culls and filters triangles for different views like main view, shadow view, reflection view, GI view etc. at the same time
The optimized triangles are used to fill a screen-space Visibility Buffer or more Visibility Buffers for more views
We then render lights, shadows, bounce lights with the optimized geometry based on visibility
We can differ between visibility of geometry and shading frequency
We can light per triangle or in so called object space

Please download it from here http://www.confettispecialfx.com/gdce-2016-the-filtered-and-culled-visibility-buffer-2/

Intel Blog: Performance Considerations for Resource Binding in Microsoft DirectX* 12

2015-10-22T14:49:00.005-07:00

I wrote another blog entry for Intel.

Performance Considerations for Resource Binding in Microsoft DirectX* 12

Implementation of the GPU Pro 5 Screen-Space Glossy Reflection Algorithm

2015-08-12T16:15:00.001-07:00

Someone (can't find the name on the website) provided an implementation of a GPU Pro 5 article:

http://roar11.com/2015/07/screen-space-glossy-reflections/

Pretty cool!

MVP Award

2015-07-01T10:12:00.000-07:00

This year I was honored with an MVP award. This is the tenth time in a row and I am very excited about this. I would like to thank everyone for supporting my nominations for the last 10 years.

Here is my MVP page:

http://mvp.microsoft.com/en-us/mvp/Wolfgang%20Engel-35704

A lot of things that I do during the year do not find their way onto this blog. Most of the time I am too busy doing these things, that leaves me with not much time to blog about them. I also consider this blog more an offer to provide advice or insights into things I am working on in my spare time (outside of Confetti). On top of that with Confetti growing more and more over the last more than six years, my programming time including spare time decreased.

In general I do not give myself much time to reflect what happened during those ten years as a MVP. I am still trying to understand the dimension of being active in a highly volatile industry like the game industry for 10 years. Obviously I am already much longer in the industry.

10 years ago a new console generation launched with the XBOX 360 / PS3. We considered that launch a major event because these two platforms together with the PC were considered the main gaming devices for the next seven years. Only two years later, mobile games started to take off after Steve Jobs changed his mind about not supporting native programming on the iOS devices.
Today we have devices like the iPad Air 2 and the NVIDIA Shield that offer performance close to the XBOX 360 / PS3 and the big console manufacturers have a serious challenge in competing with the many mobile devices that people already have in their homes. It became so easy for companies to launch their own consoles that now many companies are launching mini consoles that use more advanced mobile parts.

The production models in the industry are rapidly changing. Similar to the movie industry, parts of the industry move away from the monolithic model of having large dev-teams on games to more flexible strike teams, where they hire companies like Confetti to come in and take care of graphics and tools instead of having a group of people permanently on staff for those tasks for a long time.
This is an exciting development for Confetti and I feel like we are in the middle of it.
It will be interesting where all this will go ... one thing I know is that we will become better every year. We will always strive to make the next year better than the previous year, improve efficiency, learn more.

With the companies that haven't adjusted to the strike team model, there is the unfortunate development over the last 10 years that they keep flooding the news with large layoffs, they send out press releases saying that they had to reduce workforce out of reasons like "aligning" expectations, budget, lack of success etc.. Many of those press releases express a snide view on the treatment of humans that remind of the darker time of slavery.

One more unfortunate development with sharing information over the last 10 years is, that most of the information that is shared on conferences now have software patents attached to them. So in case someone wants to implement them (obviously without knowing it: every employee is told that they are not allowed to read patent descriptions) his / her company might have to pay for them in the future. The system of freely sharing information and helping other developers to succeed with the difficult technical implications was turned upside down in favor of companies with large law units. The willingness of developers to help their peers is used by companies to secure future economic advantages.
On top of that middleware companies like Unity and others have a hard time to open-source their engine because they are concerned that they violate various patents and therefore would run into huge economic risks when they share source code.

Apart from the "strike team" model, the most exciting development is the new breed of developers that adjusted to the new economic pressures of the App store model and make a living from new and innovative games. We had the pleasure of working with some of these and it is an awesome experience to feel the creative and positive energy that is flowing in those companies. They remind me of the development in the middle of the 90's when what we call now the game industry booted into "big" games that reach millions of people. This new generation now reaches hundreds of millions of people. You could say this is the third wave of game developers, being the first wave the developers of the 80's, the second wave the developers of the 90's.

Link Collection

2015-06-14T13:17:00.002-07:00

Here are some links that show some interesting progress:

HUSL is a human-friendly alternative to HSL

The Lost Art of C Structure Packing

A Picture To Show You Clearly The Effects of Aperture, Shutter Speed and ISO On Images

Constant Buffers without Constant Pain

What's New in CPUs Since the 80s and How Does It Affect Programmers?

JPS+: Over 100x Faster than A*

Adaptive Depth Bias for Shadow Maps

Multi-GPU Game Engine

2015-05-31T20:23:00.000-07:00

Many high-end rendering solutions for -for example- battlefield simulations can utilize now hardware solutions with multiple consumer GPUs. The idea is to split up computational power in-between 4 - 8 GPUs to increase the level of realism as much as possible.
Now with more modern APIs like DirectX 12 and probably Vulcan and before that CUDA, splitting up the rendering pipeline can happen in the following way:
- GPU0 - fills up the G-Buffer after a Z pre-pass
- GPU1 - Renders Deferred Lights and Shadows
- GPU2 - Renders Particles and Vegetation
- GPU3 - Renders Screen-Space Materials like skin etc. and PostFX

Now you can use the result of GPU0 and feed it to GPU1 and then feed it to GPU2 and so on. All this will run in parallel but will introduce two or three frames of lag (depending on how you light Particles and Vegetation). As long as the system renders 60 fps or 120 fps this will not be as much noticeable (obviously one of the targets is to have a high framerate to make animations look smooth and then also having 4K resolution rendering). GPU4 and higher can work on Physics, AI and other things. There is also the opportunity to spread out G-Buffer rendering over several GPUs, like one GPU is doing the Z pre-pass, then another fills up diffuse, normal and probably some geometry data to indentify different objects later or store their edges and another GPU is filling up the terrain data. Vegetation can be rendered on a dedicated GPU etc. etc.. On the CPU side the rule of thumb is that at least 2 cores are needed for one GPU. It is probably better to go for 3 or four. So a four GPU machine should have 8 - 16 CPU cores and a eight GPU machine 16 - 32 CPU cores; which might be split between several physical CPUs. We need at least 2x as much CPU RAM as the GPUs have RAM, so if four GPUs have each 2 GB, we need at least 16 GB Ram, if we have eight GPUs, we need at least 32 GB RAM etc..
A 4K resolution consists of 3840 × 2160 pixels and it will occupy with four render targets, each 32-bit per pixel (8:8:8:8 or 11:11:10), roughly 126.56 MB. This number goes up with 4x or 8x MSAA and maybe super-sampling. It is probably save to assume that the G-Buffer might occupy between 500 and 1GB.
Achieving a frametime of 8 - 16ms, means that even a high-end GPU will be quite busy to fill up a G-Buffer this size. So thinking about splitting this between two GPUs might make sense.
A high-end PostFX pipeline is now < 5ms on medium-class GPUs but dedicating a whole high-end GPU means we can finally switch on the movie settings :-)
A GPU particle system can easily saturate a GPU with 16 ms ... especially if it is not rendering in a quarter size resolution.
For Lights and shadows it depends on the number of lights that should be applied. Caching all the shadow data in partially resident textures or cube maps or any other shadow map technique will hit the memory budget of this card substantially.

Note: I wrote this more than two years ago. At the time a G-Buffer was a valid solution for designing a rendering system. Now with the high-res displays it is not anymore.

V Buffer - Deferred Lighting Re-Thought

2015-05-27T10:56:00.000-07:00

After eight years I would like to go back to re-design the existing rendering systems, so that they are capable to run more efficiently on high-resolution devices and display more lights with attached shadows.

Let's first see where we are: the Light Pre-Pass was introduced in March 2008 on this blog. At this point I had it already running in one R* game for a while. It eventually shipped in a large number of games and also outside of R*. The S.T.A.L.K.E.R series and the games developed by Naughty Dog had at the time a similar approach. Since then a number of modifications were proposed.
One modification was to calculate lighting by tiling the G-Buffer, then sorting lights into those tiles and then execute each tile with its light. Johan Andersson covered a practical implementation in "DirectX 11 rendering in Battlefield 3" (http://www.slideshare.net/DICEStudio/directx-11-rendering-in-battlefield-3). Before Tiled-Deferred, lights were additively blended into a buffer, consuming memory bandwidth with each blit. The Tiled-Deferred approach reduced memory bandwidth consumption substantially by resolving all the lights in one tile.
The drawback of this approach is the higher minimum run-time cost. Sorting the lights into the tiles raised the "resting" workload even when only a few lights were rendered. Compared to the older approaches it didn't break even until one rendered a few dozen lights. Additionally as soon as lights had to be drawn with shadows, the memory bandwidth savings were negligible.
Newer approaches like "Clustered Deferred and Forward Shading" (http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf) by Ola Olsson et all. started solving the "light overdraw" problem in even more efficient ways. A practical implementation is shown on Emil Perrson's website (http://www.humus.name/Articles/PracticalClusteredShading.pdf) in an example program.
Because transparency solutions with all the approaches mentioned above are inconsistent with the way opaque objects are handled, there was a group of people that wishes to go back to forward rendering. Takahiro Harada described and refined an approach that he called Forward+ (http://www.slideshare.net/takahiroharada/forward-34779335). The tiled-based handling of light sources was similar to the Tiled-Deferred approach. The advantage of having a consistent way of lighting transparent and opaque objects was bought by having to re-submit all potentially visible geometry several times.

Filling a G-Buffer or re-submitting geometry in case of Forward+ is expensive. For the Deferred Lighting implementations, the G-Buffer fill was the stage were also visibility of geometry was solved (there is also the option of making a Z Pre-Pass which means geometry is submitted one more time at least).
With modern 4k displays and high-res devices like Tablets and smart phones, a G-Buffer is not a feasible solution anymore. When the Light Pre-Pass was developed a 1280x720 resolution was considered state of the art. Today 1080p is considered the minimum resolution, iOS and Android devices have resolutions several times this size and even modern PC monitors can have more than 4K resolution.
MSAA increases the size and therefore cost manifold.

Instead of rendering geometry into three or four render targets with overdraw (or re-submitting after the Z prepass), we need to find a way to store visibility data separate in a much smaller buffer, in a more efficient way.
In other words, if we could capture the full-screen visibility of geometry in as small a footprint as possible, we could significantly reduce the cost of geometry submission and pixel overdraw afterwards.

A first idea on how this could be done is described in the article "The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading" by Christopher A. Burns et all.. The article outlines the idea to store per triangle visibility data in a visibility buffer.

Introduction to Resource Binding in Microsoft DirectX* 12

2015-04-09T15:21:00.003-07:00

I spent some time to write an article that should explain resource binding in DirectX 12. When I looked at this for the first time I had a tough time to get my head around resource binding ... so I am hoping this article makes it for others easier to understand. Let me know in the comments ...

https://software.intel.com/en-us/articles/introduction-to-resource-binding-in-microsoft-directx-12

Reloaded: Compute Shader Optimizations for AMD GPUs: Parallel Reduction

2015-01-12T15:50:00.004-08:00

After nearly a year, it was time to revisit the last blog entry. The source code of the example implementation was still on one of my hard-drives and needed to be cleaned-up and released, which I had planned for the first quarter of last year.
I also did receive a comment high-lighting a few mistakes I made in the previous blog post and on top of that I wanted to add numbers for other GPUs as well.

Now while looking at the code the few hours of time I had reserved for the task turned into a day and then a bit more. On top of that getting some time off from my project management duties at Confetti was quite enjoyable :-)

In the previous blog post I forgot to mention that I used INTEL's GPA to measure all the performance numbers. Several runs of the performance profiler always generated slightly different results but I felt the overall direction is becoming clear.
My current setup uses the currently latest AMD driver 14.12.

All the source code can be found at

https://code.google.com/p/graphicsdemoskeleton/source/browse/#svn%2Ftrunk%2F04_DirectCompute%20Parallel%20Reduction%20Case%20Study

While comparing the current performance numbers with the previous setup from the previous post, it becomes obvious that not much has changed for the first three rows. Here is the new chart:

Latest Performance numbers from January 2015

In the fourth column ("Pre-fetching two color values into TGSM with 64 threads"), the numbers for the 6770 are nearly cut in half while they stay roughly the same for the other cards; only a slight improvement on the 290X. This is the first shader that fetches two values from device memory, converts them to luminance, stores them into shared memory and then kicks off the Parallel Reduction.
Here is the source code.

StructuredBuffer Input : register( t0 );

RWTexture2D Result : register (u0);

#define THREADX 8

#define THREADY 16

cbuffer cbCS : register(b0)

{

int c_height : packoffset(c0.x);

int c_width : packoffset(c0.y); // size view port

This is in the constant buffer as well but not used in this shader, so I just keep it in here as a comment

float c_epsilon : packoffset(c0.z); // julia detail

int c_selfShadow : packoffset(c0.w); // selfshadowing on or off

float4 c_diffuse : packoffset(c1); // diffuse shading color

float4 c_mu : packoffset(c2); // julia quaternion parameter

float4x4 rotation : packoffset(c3);

float zoom : packoffset(c7.x);

};

// the following shader applies parallel reduction to an image and converts it to luminance

#define groupthreads THREADX * THREADY

groupshared float sharedMem[groupthreads];

[numthreads(THREADX, THREADY, 1)]

void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )

{

const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);

// thread groups in x is 1920 / 16 = 120

// thread groups in y is 1080 / 16 = 68

// index in x (1920) goes from 0 to 119 | 120 (thread groups) * 8 (threads) = 960 indices in x

// index in y (1080) goes from 0 to 67 | 68 (thread groups) * 16 (threads) = 1080 indices in y

uint idx = ((DTid.x * 2) + DTid.y * c_width);

// 1920 * 1080 = 2073600 pixels
// 120 * 68 * 128(number of threads : 8 * 16) * 2 (number of fetches) = 2088960

float temp = (dot(Input[idx], LumVector) + dot(Input[idx + 1], LumVector));

sharedMem[GI] = temp;

// wait until everything is transfered from device memory to shared memory

GroupMemoryBarrierWithGroupSync();

// hard-coded for 128 threads

if (GI < 64)

sharedMem[GI] += sharedMem[GI + 64];

GroupMemoryBarrierWithGroupSync();

if (GI < 32) sharedMem[GI] += sharedMem[GI + 32];

if (GI < 16) sharedMem[GI] += sharedMem[GI + 16];

if (GI < 8) sharedMem[GI] += sharedMem[GI + 8];

if (GI < 4) sharedMem[GI] += sharedMem[GI + 4];

if (GI < 2) sharedMem[GI] += sharedMem[GI + 2];

if (GI < 1) sharedMem[GI] += sharedMem[GI + 1];

// Have the first thread write out to the output

if (GI == 0)

{

// write out the result for each thread group

Result[Gid.xy] = sharedMem[0] / (THREADX * THREADY * 2);

}

The grid size in x and why is 1920 / 16 and 1080 / 16. In other words this is the number of thread groups kicked off by the dispatch call.

The next shader extends the idea to fetching four values. It fetches four instead of two values from device memory.

// thread groups in x is 1920 / 16 = 120

// thread groups in y is 1080 / 16 = 68

// index in x (1920) goes from 0 to 119 | 120 (thread groups) * 4 (threads) = 480 indices in x

// index in y (1080) goes from 0 to 67 | 68 (thread groups) * 16 (threads) = 1080 indices in y

uint idx = ((DTid.x * 4) + DTid.y * c_width);

// 1920 * 1080 = 2073600 pixels

// 120 * 68 * 64 (number of threads : 4 * 16) * 4 (number of fetches) = 2088960

float temp = (dot(Input[idx], LumVector) + dot(Input[idx + 1], LumVector))

+ (dot(Input[idx + 2], LumVector) + dot(Input[idx + 3], LumVector));

// store in shared memory

sharedMem[IndexOfThreadInGroup] = temp;

// wait until everything is transfered from device memory to shared memory

GroupMemoryBarrierWithGroupSync();

if (IndexOfThreadInGroup < 32) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 32];

if (IndexOfThreadInGroup < 16) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 16];

if (IndexOfThreadInGroup < 8) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];

if (IndexOfThreadInGroup < 4) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];

if (IndexOfThreadInGroup < 2) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];

if (IndexOfThreadInGroup < 1) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];

Looking at the performance results ("Pre-fetching four color values into TGSM with 64 threads"), the difference between the performance numbers is not significant. This seems to be the first sign that the shader might be read memory bandwidth limited. Just reading the 1080p memory area takes the longest time.

While all the previous shaders were writing the reduced image into a 120 x 68 area, The following two shaders in the chart are writing into a 60 x 34 area. This is mostly achieved by decreasing the grid size, or in other words running less thread groups. To make up for the decrease in grid size we had to increase the size of each thread group to 256 and then 512.

#define THREADX 8
#define THREADY 32

... // more code here

// thread groups in x is 1920 / 32 = 60
// thread groups in y is 1080 / 32 = 34
// index in x (1920) goes from 0 to 60 (thread groups) * 8 (threads) = 480 indices in x
// index in y (1080) goes from 0 to 34 (thread groups) * 32 (threads) = 1088 indices in y
uint idx = ((DTid.x * 4) + DTid.y * c_width);

// 1920 * 1080 = 2073600 pixels
// 60 * 34 * 256 (number of threads : 8 * 32) * 4 (number of fetches) = 2088960
float temp = (dot(Input[idx], LumVector) + dot(Input[idx + 1], LumVector))
+ (dot(Input[idx + 2], LumVector) + dot(Input[idx + 3], LumVector));

// store in shared memory
sharedMem[IndexOfThreadInGroup] = temp;

// wait until everything is transfered from device memory to shared memory
GroupMemoryBarrierWithGroupSync();

// hard-coded for 256 threads
if (IndexOfThreadInGroup < 128)
sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 128];
GroupMemoryBarrierWithGroupSync();

if (IndexOfThreadInGroup < 64)
sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 64];
GroupMemoryBarrierWithGroupSync();

if (IndexOfThreadInGroup < 32) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 32];
if (IndexOfThreadInGroup < 16) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 16];
if (IndexOfThreadInGroup < 8) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];
if (IndexOfThreadInGroup < 4) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];
if (IndexOfThreadInGroup < 2) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];
if (IndexOfThreadInGroup < 1) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];

... // more code here

The next shader decreases the grid size even more and increases the number of threads of each thread group to 1024; the current maximum that the Direct3D run-time allows. For both shaders ("Pre-fetching four color values into TGSM with 1024 threads" and then "Pre-fetching four color values into 2x TGSM with 1024 threads"), the performance numbers do not change much compared to the previous shaders, although the reduction has to do more work, because the dimension of the target area halve in each direction. Here is the source code for the second of the two shaders that fetch four color values with 1024 threads per thread group:

#define THREADX 16
#define THREADY 64
//.. constant buffer code here
//
// the following shader applies parallel reduction to an image and converts it to luminance
//
#define groupthreads THREADX * THREADY
groupshared float sharedMem[groupthreads * 2]; // double the number of shared mem slots

[numthreads(THREADX, THREADY, 1)]
void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
{
const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);

// thread groups in x is 1920 / 64 = 30
// thread groups in y is 1080 / 64 = 17
// index in x (1920) goes from 0 to 29 | 30 (thread groups) * 16 (threads) = 480 indices in x
// index in y (1080) goes from 0 to 16 | 17 (thread groups) * 64 (threads) = 1088 indices in y
uint idx = ((DTid.x * 4) + DTid.y * c_width); // index into structured buffer

// 1920 * 1080 = 2073600 pixels
// 30 * 17 * 1024 (number of threads : 16 * 64) * 4 (number of fetches) = 2088960
uint idSharedMem = GI * 2;
sharedMem[idSharedMem] = (dot(Input[idx], LumVector) + dot(Input[idx + 1], LumVector));
sharedMem[idSharedMem + 1] = (dot(Input[idx + 2], LumVector) + dot(Input[idx + 3], LumVector));

// wait until everything is transfered from device memory to shared memory
GroupMemoryBarrierWithGroupSync();

// hard-coded for 1024 threads
if (GI < 1024)
sharedMem[GI] += sharedMem[GI + 1024];
GroupMemoryBarrierWithGroupSync();

if (GI < 512)
sharedMem[GI] += sharedMem[GI + 512];
GroupMemoryBarrierWithGroupSync();

if (GI < 256)
sharedMem[GI] += sharedMem[GI + 256];
GroupMemoryBarrierWithGroupSync();

if (GI < 128)
sharedMem[GI] += sharedMem[GI + 128];
GroupMemoryBarrierWithGroupSync();

if (GI < 64)
sharedMem[GI] += sharedMem[GI + 64];
GroupMemoryBarrierWithGroupSync();

if (GI < 32) sharedMem[GI] += sharedMem[GI + 32];
if (GI < 16) sharedMem[GI] += sharedMem[GI + 16];
if (GI < 8) sharedMem[GI] += sharedMem[GI + 8];
if (GI < 4) sharedMem[GI] += sharedMem[GI + 4];
if (GI < 2) sharedMem[GI] += sharedMem[GI + 2];
if (GI < 1) sharedMem[GI] += sharedMem[GI + 1];

One thing I wanted to try here, is utilize double the amount of shared memory and therefore saturate the 1024 threads more by having the first addition happening in shared memory. At the end that didn't change much because the shader is not utilizing temp registers much, so replacing a temp register with using shared memory didn't increase performance much.

My last test was aiming at fetching 16 color values while decreasing the 1080p image to 15x9. The result is shown in the last column. This shader also uses 1024 threads and fetches into 2x the shared memory like the previous one. It runs slower than the previous shaders. Here is the source code:

#define THREADX 16
#define THREADY 64
//.. some constant buffer code here
//
// the following shader applies parallel reduction to an image and converts it to luminance
//
#define groupthreads THREADX * THREADY
groupshared float sharedMem[groupthreads * 2]; // double the number of shared mem slots

[numthreads(THREADX, THREADY, 1)]
void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
{
const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);

// thread groups in x is 1920 / 128 = 15
// thread groups in y is 1080 / 128 = 9
// index in x (1920) goes from 0 to 14 | 15 (thread groups) * 16 (threads)
// = 240 indices in x | need to fetch 8 in x direction
// index in y (1080) goes from 0 to 8 | 9 (thread groups) * 64 (threads)
// = 576 indices in y | need to fetch 2 in y direction
uint idx = ((DTid.x * 8) + (DTid.y * 2) * c_width); // index into structured buffer

// 1920 * 1080 = 2073600 pixels
// 15 * 9 * 1024 (number of threads : 16 * 64) * 15 (number of fetches) = 2073600
uint idSharedMem = GI * 2;
sharedMem[idSharedMem] = (dot(Input[idx], LumVector)
+ dot(Input[idx + 1], LumVector)
+ dot(Input[idx + 2], LumVector)
+ dot(Input[idx + 3], LumVector)
+ dot(Input[idx + 4], LumVector)
+ dot(Input[idx + 5], LumVector)
+ dot(Input[idx + 6], LumVector)
+ dot(Input[idx + 7], LumVector));
sharedMem[idSharedMem + 1] = (dot(Input[idx + 8], LumVector)
+ dot(Input[idx + 9], LumVector)
+ dot(Input[idx + 10], LumVector)
+ dot(Input[idx + 11], LumVector)
+ dot(Input[idx + 12], LumVector)
+ dot(Input[idx + 13], LumVector)
+ dot(Input[idx + 14], LumVector)
+ dot(Input[idx + 15], LumVector));

// wait until everything is transfered from device memory to shared memory
GroupMemoryBarrierWithGroupSync();

// hard-coded for 1024 threads
if (GI < 1024)
sharedMem[GI] += sharedMem[GI + 1024];
GroupMemoryBarrierWithGroupSync();

if (GI < 512)
sharedMem[GI] += sharedMem[GI + 512];
GroupMemoryBarrierWithGroupSync();

if (GI < 256)
sharedMem[GI] += sharedMem[GI + 256];
GroupMemoryBarrierWithGroupSync();

if (GI < 128)
sharedMem[GI] += sharedMem[GI + 128];
GroupMemoryBarrierWithGroupSync();

if (GI < 64)
sharedMem[GI] += sharedMem[GI + 64];
GroupMemoryBarrierWithGroupSync();

if (GI < 32) sharedMem[GI] += sharedMem[GI + 32];
if (GI < 16) sharedMem[GI] += sharedMem[GI + 16];
if (GI < 8) sharedMem[GI] += sharedMem[GI + 8];
if (GI < 4) sharedMem[GI] += sharedMem[GI + 4];
if (GI < 2) sharedMem[GI] += sharedMem[GI + 2];
if (GI < 1) sharedMem[GI] += sharedMem[GI + 1];

Looking at all those numbers it seems that the performance is mostly limited by the speed on how to read the 1080p source buffer. In the moment I would like to predict that reducing the source resolution to 720p or 480p would lead to a more differentiated view of performance. Maybe something to try in the future ...

Compute Shader Optimizations for AMD GPUs: Parallel Reduction

2014-03-26T14:19:00.002-07:00

We recently looked more often into compute shader optimizations on AMD platforms. Additionally I had a UCSD class in Winter that dealt with this topic and a talk at the Sony booth at GDC 2014 that covered the same topic.
This blog post covers a common scenario while implementing a post-processing pipeline: Parallel Reduction. It uses the excellent talk given by Mark Harris a few years back as a starting point, enriched with new discoveries, credited to the new hardware platforms and AMD specifics.

The topics covered are:

Sequential Shared Memory (TGSM) Access: utilizing the Memory bank layout
When to Unroll Loops in a compute shader

Overhead of address arithmetic and loop instructions
Skipping Memory Barriers: Wavefront

Pre-fetching data into Shared Memory

Most of the examples accompanying this blog post are showing a simple parallel reduction going from 1080p to 120x68. While reducing the size of the image, these examples also reduce the color color value to luminance.

Image 1 - Going from 1080p to 120x68 and from Color to Luminance

On an algorithmic level, Parallel Reduction looks like a tree-based approach:

Image 2 - Tree-based approach for Parallel Reduction

Instead of building a fully recursive kernel, which is not possible on current hardware, the algorithm mimics recursion by using a for loop.
As you will see later on, the fact that each of the invocations utilizes less threads from a pool of threads in a thread group, has some impact on performance. Let's say we allocate 256 threads in a thread pool, only the first iteration of the Parallel Reduction algorithm will use all of them. The second iteration -based on the implementation- might only use half of them and the next one again half of those etc..

TGSM Access: Utilizing the Memory bank layout

One of the first rules of thumb mentioned by Nicolas Thibieroz is dealing with the access pattern that is used to access TGSM. There is only a limited number of I/O banks and they need to be utilized in the most efficient way. It turns out that AMD and NVIDIA seem to have both 32 banks.

Image 3 - Memory banks are arranged linearly with addresses

Accessing TGSM with addresses that are 32 DWORD apart will lead to a situation where threads will use the same bank. This generates so called bank conflicts. In other words, accessing the same address from multiple threads creates bank conflicts.
The preferred method to access TGSM is to have 32 threads use 32 different banks. Usually the extreme example mentioned is the 2D array example, where you want to access memory by increasing the bank number first -you might consider this moving horizontally- and then increase the vertical direction. This way threads will hit different banks more often.
The more subtle bank conflicts happen when memory banks are accessed in non-sequential patterns. Mark Harris has shown the following example. Here is an image depicting this:

Image 4 - Memory banks accessed interleaved

The first example source is showing an implementation of this memory access pattern:

// Example for Interleaved Memory access

[numthreads(THREADX, THREADY, 1)]

void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )

{

const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);

uint idx = DTid.x + DTid.y * c_width; // read from structured buffer

sharedMem[GI] = dot(Input[idx], LumVector); // store in shared memory

GroupMemoryBarrierWithGroupSync(); // wait until everything is transfered from device memory to shared memory

[unroll(groupthreads)]

for (uint s = 1; s < groupthreads; s *= 2) // stride: 1, 2, 4, 8, 16, 32, 64, 128

{

int index = 2 * s * GI;

if (index < (groupthreads))

sharedMem[index] += sharedMem[index + s];

GroupMemoryBarrierWithGroupSync();

}

// Have the first thread write out to the output

if (GI == 0)

{

// write out the result for each thread group

Result[Gid.xy] = sharedMem[0] / (THREADX * THREADY);

}

This code fetches TGSM in its for loop in a pattern as showed in Image 4. The image of a sequential access pattern is supposed to look like this:

Image 5 - Memory banks accessed sequential

The source code of the sequential access version looks like this:

[numthreads(THREADX, THREADY, 1)]

void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )

{

const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);

uint idx = DTid.x + DTid.y * c_width; // read from structured buffer

sharedMem[GI] = dot(Input[idx], LumVector); // store in shared memory

GroupMemoryBarrierWithGroupSync(); // wait until everything is transfered from device memory to shared memory

[unroll(groupthreads / 2)]

for (uint s = groupthreads / 2; s > 0; s >>= 1)

{

if (GI < s)

sharedMem[GI] += sharedMem[GI + s];

GroupMemoryBarrierWithGroupSync();

}

// Have the first thread write out to the output

if (GI == 0)

{

// write out the result for each thread group

Result[Gid.xy] = sharedMem[0] / (THREADX * THREADY);

}

The changes are marked in red. While on previous hardware generations, this slight change in source code had some impact on the performance, it looks like on modern AMD GPUs it doesn't seem to make a difference anymore. All the measurements were done on an AMD RADEON(TM) HD 6770, an AMD RADEON(TM) HD 7750 and an AMD RADEON(TM) HD 7850:

Image 5 - Peformance of Interleaved / Sequential TGSM access pattern

In case of our example program re-arranging the access pattern doesn't make a difference. It might be that the driver re-arranges the code already or the hardware re-directs the accesses.

Unroll the Loops
A likely overhead of the shaders shown above is the instruction overhead of ancillary instructions that are not loads, stores, or arithmetic instructions for core computations. In other words address arithmetic and loop instructions overhead.
Thread groups that access Thread Group Shared Memory are automatically broken down into hardware schedulable groups of threads. In case of NVIDIA, those are called Warps and there are 32 threads in a warp, and in case of AMD they are called a Wavefront and there are 64 threads in a wavefront (there is a finer level of granularity regarding Wavefronts that we won't cover here). Instructions are SIMD synchronous within a Warp or Wavefront. That means as long as the number of threads executed are below 32 for NVIDIA or below 64 on AMD, a memory barrier is not necessary.
In case of the tree like algorithm that is used for Parallel Reduction as shown in Image 2, the number of threads utilized in a loop are decreasing. As soon as they are below 32 or 64, a memory barrier shouldn't be necessary anymore.
This means that unrolling loops not only might save some ancillary instructions but also might reduce the number of memory barriers used in a compute shader. Source code for an unrolled loop might look like this:

… // like the previous shader

if (groupthreads >= 256)

{

if (GI < 128)

sharedMem[GI] += sharedMem[GI + 128];

GroupMemoryBarrierWithGroupSync();

}

// AMD - 64 / NVIDIA - 32

if (GI < 64)

{

if (groupthreads >= 64) sharedMem[GI] += sharedMem[GI + 32];

if (groupthreads >= 32) sharedMem[GI] += sharedMem[GI + 16];

if (groupthreads >= 16)sharedMem[GI] += sharedMem[GI + 8];

if (groupthreads >= 8)sharedMem[GI] += sharedMem[GI + 4];

if (groupthreads >= 4)sharedMem[GI] += sharedMem[GI + 2];

if (groupthreads >= 2)sharedMem[GI] += sharedMem[GI + 1];

}

…

The performance numbers for this optimization show that older hardware appreciates the effort of unrolling the loop and decreasing the number of memory barriers more than newer designs:

Image 6 - Unrolled Loops / Less Memory Barriers performance Impact

Pre-fetching two Color Values into Shared Memory
When looking at the previous shader, the only operation that utilizes all 256 threads in the thread group is the first load into shared memory. By fetching two color values from device memory and adding them already at the beginning of the shader, we could utilize the threads better.
To stay consistent with the previous Parallel Reduction shaders and offering the same 1080p to 120x68 reduction, the following shader only uses 64 threads in the thread group.

…

// pack two values

// like the previous shader

// store in shared memory

float temp = (dot(Input[idx * 2], LumVector) + dot(Input[idx * 2 + 1], LumVector));

sharedMem[GI] = temp;

// AMD - 64 / NVIDIA - 32

if (GI < 32)

{

if (groupthreads >= 32) sharedMem[GI] += sharedMem[GI + 16];

if (groupthreads >= 16)sharedMem[GI] += sharedMem[GI + 8];

if (groupthreads >= 8)sharedMem[GI] += sharedMem[GI + 4];

if (groupthreads >= 4)sharedMem[GI] += sharedMem[GI + 2];

if (groupthreads >= 2)sharedMem[GI] += sharedMem[GI + 1];

}

…

Because the number of threads used is 64, memory barriers are not necessary.

Image 7 - Pre-fetching two color values

It seems that throughout the hardware generations, the performance benefits from fetching two values at the same time, although the number of threads per thread group were reduced to 64 from 256 are appreciated. The reduced number of threads will become a topic later on.

Pre-fetching four Color Values into Shared Memory
With the success story behind fetching two color values into TGSM, the obvious question arises, what would happen if four values would be fetched. To keep the Parallel Reduction algorithm comparable, so that it reduces from 1080p to 120x68, the threads in the thread group are reduced again.
The following shader only uses 16 threads per thread group and is therefore considered not efficient in this respect. The official rule of thumb is using a multiply of 64. On the bright side it doesn't use any memory barriers.

…

// pack four values

#define THREADX 4

#define THREADY 4

…

// like the previous shader

float temp = (dot(Input[idx * 4], LumVector) + dot(Input[idx * 4 + 1], LumVector))

+ (dot(Input[idx * 4 + 2], LumVector) + dot(Input[idx * 4 + 3], LumVector));

// store in shared memory -> no group barrier

sharedMem[IndexOfThreadInGroup] = temp;

// AMD - 64 / NVIDIA - 32

if (IndexOfThreadInGroup < 16)

{

if (groupthreads >= 16)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];

if (groupthreads >= 8)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];

if (groupthreads >= 4)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];

if (groupthreads >= 2)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];

}

…

The performance increase compared to the previous shader shows a nearly linear increase:

Image 8 - Pre-fetching four color values

Looking at the improvements of fetching four instead of two color values brings up the question, how would performance change if the number of threads in a thread group would be increased and the number of thread groups in the dispatch then decreased, which also leads to a higher parallel reduction because the resulting image is smaller.
The next example increases the number of threads from 16 to 64:

…

// pack four values

#define THREADX 8

#define THREADY 8

…

// like the previous shader

float temp = (dot(Input[idx * 4], LumVector) + dot(Input[idx * 4 + 1], LumVector))

+ (dot(Input[idx * 4 + 2], LumVector) + dot(Input[idx * 4 + 3], LumVector));

// store in shared memory > no group barrier

sharedMem[IndexOfThreadInGroup] = temp;

// AMD - 64 / NVIDIA - 32

if (IndexOfThreadInGroup < 64)

{
if (groupthreads >= 64)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 32];
if (groupthreads >= 32)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 16]; if (groupthreads >= 16)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];

if (groupthreads >= 8)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];

if (groupthreads >= 4)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];

if (groupthreads >= 2)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];

}

…

Similar to the previous shader this shader avoids any memory barriers but it runs with 64 instead of 16 threads and it is not executed as often, because the grid size was reduced to 60 x 34.

Image 9 - Increasing the number of Threads from 16 to 64 and decreasing the size of the result

Although the number of threads are increased, the workload of this shader is also increased due to halving the size of the resulting image in each direction. In other words this shader does more work than the previous shaders. This allows the conclusion that this shader runs faster than the previous one.

Following the successful path of increasing the number of threads, the last shader in this blog post will use 256 threads to parallel reduce the image size from 1080p to 30x17.

... // like the previous shaders

// store in shared memory

sharedMem[IndexOfThreadInGroup] = temp;

// wait until everything is transfered from device memory to shared memory

GroupMemoryBarrierWithGroupSync();

if (groupthreads >= 256)

{

if (IndexOfThreadInGroup < 128)

sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 128];

GroupMemoryBarrierWithGroupSync();

}

// AMD - 64 / NVIDIA - 32

if (IndexOfThreadInGroup < 64)

{

if (groupthreads >= 64)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 32];

if (groupthreads >= 32)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 16];

if (groupthreads >= 16)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];

if (groupthreads >= 8)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];

if (groupthreads >= 4)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];

if (groupthreads >= 2)sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];

}

...

With the increased number of threads we have to add memory barriers again. Nevertheless this shader runs quicker than all the previous shaders while -at the same time- doing more work:

Image 10 - Increasing the number of Threads from 64 to 256 and decreasing the size of the result

Please note how the older GPU starts beating the newer GPU when the number of threads are increased. Overall for the 6770, we went from roughly 1 ms to close to a tenth of the original time frame. For the 7750 and the 7850 we ended up reducing the frame time to roughly a bit more than a fourth, while increasing the workload in the last two test setups.

Conclusion

Like with most optimization tasks there is always more to consider and more to try out. A list of things that would be worth considering is still short but give some time, it will increase.

If you -the valued reader of this blog- have anything you want me to try and add to this list, please let me know and I will add it to this blog post.

Overall I believe the case studies shown above should give someone a good starting point to optimize the Parallel Reduction part of a post-processing pipeline.

One other topic crucial for the performance of a post-processing pipeline is the speed of the blur kernel. Optimizations that lead to the "ultimate" blur kernel will have to wait for a future blog post :-)

Thanks to Stephan Hodes from AMD for providing feedback.

DirectX 12 Blog

2014-03-24T15:06:00.001-07:00

Finally information about DirectX 12 is published on Matt Sandy's blog.

Today I wear my DirectX 12 T-Shirt to work ... below this shirt I am wearing the Mantle T-Shirt (... I was thinking about the order for a while but only this order can make sense ... right?).
I had the opportunity to test drive DirectX 12 in the last couple of months and it looks already great. Very excited to work with DirectX 12 and Mantle in the near future.

GDC 2014 - Compute Shader Optimizations

2014-03-13T06:19:00.000-07:00

I will be speaking at the Sony booth on Wednesday at 5pm on compute shader optimizations. The 15 minute talk will be broadcast on Twitch.
The talk will cover performance numbers of three different AMD GPUs: RADEON 6770, RADEON 7750 and RADEON 7850.
The main topics are:

Sequential Shared Memory (TGSM) Access: utilizing the Memory bank layout
When to Unroll Loops in a compute shader

Overhead of address arithmetic and loop instructions
Skipping Memory Barriers: Wavefront

Pre-fetching data into Shared Memory
Packing data into Shared Memory

Looking at two different generations of AMD GPUs makes it better visible which one of the ground rules developed for GPU optimizations works on current GPUs, compared to previous generations.

This is based on some of the optimization work we did on AAA games last year.

At Confetti we have Aura - our Dynamic Global Illumination System- and PixelPuzzle - our PostFX pipeline - running in compute.

This talk will deal with how to optimize parts of a PostFX pipeline with Compute. I am also planning to write a blog series about this.

Link Collection

2014-01-14T14:11:00.000-08:00

The book "Is Parallel Programming Hard, And, If So, What Can You Do About It?" can be found at

https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

An overview on C99 support in Visual Studio 2013 can be found at

http://blogs.msdn.com/b/vcblog/archive/2013/07/19/c99-library-support-in-visual-studio-2013.aspx

Adaptive Depth Bias for Shadow Maps

http://jcgt.org/published/0003/04/08/paper-lowres.pdf

FSE decoding : how it works

http://fastcompression.blogspot.com/2014/01/fse-decoding-how-it-works.html?spref=tw

Learning Three.js - WebGL for Dummies

http://learningthreejs.com/

Fuzebox

https://www.fuzebox.com

Blending of Normal Maps: Blending in Detail

http://blog.selfshadow.com/publications/blending-in-detail/

What Every C Programmer Should Know About Undefined Behavior #1/3

http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html?repost

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model

http://nvlabs.github.io/cub/

Visual Studio 2013 - C99 support

2014-01-10T12:14:00.000-08:00

I think using C99 in game development could be useful for large teams, especially if they are distributed over several locations.

So I thought I look a little bit closer on the support of C99 in Visual Studio 2013 (we also use VS 2013 with C99 now in my UCSD class).

The new features that are support in VS 2013 are:

New features in 2013

- variable decls

- _Bool

- compound literals

- designated initializers

Already available:

variadic macros, long long, __pragma, __FUNCTION__, and __restrict

What is missing:

- variable-length arrays (VLAs)

- Reserved keywords in C99

C99 has a few reserved keywords that are not recognized by C++:

restrict

_Bool -> this is now implemented ... see above

_Complex

_Imaginary

_Pragma

- restrict keyword

C99 supports the restrict keyword, which allows for certain optimizations involving pointers. For example:

void copy(int *restrict d, const int *restrict s, int n)

{

while (n-- > 0)

*d++ = *s++;

}

C++ does not recognize this keyword.

A simple work-around for code that is meant to be compiled as either C or C++ is to use a macro for the restrict keyword:

#ifdef __cplusplus

#define restrict /* nothing */

#endif

(This feature is likely to be provided as an extension by many C++ compilers. If it is, it is also likely to be allowed as a reference modifier as well as a pointer modifier.)

Don't know if it is in there:

- hexadecimal floating-point literals like

float pi = 0x3.243F6A88p+03;

- C99 adds a few header files that are not included as part of the standard C++ library, though:
<complex.h>
<fenv.h>
<inttypes.h>
<stdbool.h>
<stdint.h>
<tgmath.h>

References

What was added in June 2013 blog
http://blogs.msdn.com/b/vcblog/archive/2013/06/27/what-s-new-for-visual-c-developers-in-vs2013-preview.aspx
C99 library support in Visual Studio 2013
http://blogs.msdn.com/b/vcblog/archive/2013/07/19/c99-library-support-in-visual-studio-2013.aspx
Incompatibilities between C99 and C++98
http://david.tribble.com/text/cdiffs.htm

CSE 190 - GPU Programming UCSD class Winter 2014

2014-01-02T11:40:00.005-08:00

GPU Programming
With the new console generation and the advances in PC hardware, compute support is becoming more important in games. The new course in 2014 will therefore start with compute and we will spend about a 1/3 of the whole course talking about how it is used on next-gen consoles and in next-gen games. We will also look into several case studies and discuss the feasibility to "re-factor" existing game algorithms so that they run in compute. An emphasis is put here on effects that are traditionally used for post-processing effects.

The remaining 2 / 3 of the course will focus on the DirectX 11.2 graphics API and how it is used in games to create a rendering engine for a next-gen game. We will cover most of the fundamental concepts like the HLSL language, renderer design, lighting in games, how to generate shadows and we also discuss how transparency can be mimicked with techniques other than alpha blending.
The course will end with a survey of different real-time Global Illumination algorithms that are used in different types of games.

First Class
Overview
-- DirectX 11.2 Graphics
-- DirectX 11.2 Compute
-- Tools of the Trade - how to setup your development system
Introduction to DirectX 11.2 Compute
-- Advantages
-- Memory Model
-- Threading Model
-- DirectX 10.x support

Second Class
Simple Compute Case Studies
- PostFX Color Filters
- PostFX Parallel Reduction
- DirectX 11 Mandelbrot
- DirectX 10 Mandelbrot

Third Class
DirectCompute performance optimization
- Histogram optimization case study

Fourth Class
Direct3D 11.2 Graphics Pipeline Part 1
- Direct3D 9 vs. Direct3D 11
- Direct3D 11 vs. Direct3D 11.1
- Direct3D 11.1 vs. Direct3D 11.2
- Resources (typeless memory arrays)
- Resource Views
- Resources Access Intention
- State Objects
- Pipeline Stages
-- Input Assembler
-- Vertex Shader
-- Tesselation
-- Geometry Shader
-- Stream Out
-- Setup / Rasterizer
-- Pixel Shader
-- Output Merger
-- Video en- / decoder access

Fifth Class
Direct3D 11.2 Graphics Pipeline Part 2
-- HLSL
--- Keywords
--- Basic Data Types
--- Vector Data Types
--- Swizzling
--- Write Masks
--- Matrices
--- Type Casting
--- SamplerState
--- Texture Objects
--- Intrinsics
--- Flow Control
-- Case Study: implementing Blinn-Phong lighting with DirectX 11.2
--- Physcially / Observational Lighting Models
--- Local / Global Lighting
--- Lighting Implementation
---- Ambient
---- Diffuse
---- Specular
---- Normal Mapping
---- Self-Shadowing
---- Point Light
---- Spot Light

Sixth Class
Physically Based Lighting
- Normalized Blinn-Phong Lighting Model
- Cook-Torrance Reflectance Model

Seventh Class
Deferred Lighting, AA
- Rendering Many Lights History
- Light Pre-Pass (LPP)
- LPP Implementation
- Efficient Light rendering on DX 9, 10, 11
- Balance Quality / Performance
- MSAA Implementation on DX 10.0, 10.1, XBOX 360, 11
Screen-Space Materials
- Skin

Eigth Class
Shadows
- The Shadow Map Basics
- “Attaching” a Shadow Map frustum around a view frustum
- Multi-Frustum Shadow Maps
- Cascaded Shadow Maps (CSM) : Splitting up the View
- CSM Challenges
- Cube Shadow Maps
- Softening the Penumbra
- Soft Shadow Maps

Nineth Class
Order-Independent Transparency
- Depth Peeling
- Reverse Depth Peeling
- Per-Pixel Linked Lists

Tenth Class
Global Illumination Algorithms in Games
- Requirement for Real-Time GI
- Ambient Cubes
- Diffuse Cube Mapping
- Screen-Space Ambient Occlusion
- Screen-Space Global Illumination
- Reflective Shadow Maps
- Splatting Indirect Illumination (SII)

Prerequisite
Each student should bring a DirectX 11.0 or higher capable notebook with Windows 7 or 8 into class. All the examples accompanying the class are build in C/C++ in Visual Studio 2013.

Visual Studio 2013 / Demo Skeleton Programming

2013-11-07T13:11:00.004-08:00

I updated my demo skeleton in the google code repository. It is using now Visual Studio 2013, that now partially supports C99 and therefore can compile the code. I updated the compute shader code a bit and I upgraded Crinkler to version 1.4. The compute shader example now also compiles the shader into a header file and then Crinkler compresses this file as part of the data compression. It packs now overall to 2,955 bytes.

https://code.google.com/p/graphicsdemoskeleton/

If you have fun with this code, let me know ... :-)

Call for a new Post-Processing Pipeline - KGC 2013 talk

2013-09-30T11:53:00.001-07:00

This is the text version of my talk at KGC 2013.

The main motivation for the talk was the idea of looking for fundamental changes that can bring a modern Post-Processing Pipeline to the next level.
Let's look first into the short history of Post-Processing Pipelines, where we are in the moment and where we might be going in the near future.

History

Probably one of the first Post-Processing Pipelines appeared in the DirectX SDK around 2004. It was a first attempt to implement HDR rendering. I believe from there on we called a collection of image space effects at the end of the rendering pipeline Post-Processing pipeline.

The idea was to re-use resources like render targets and data with as many image space effects as possible in a Post-Processing Pipeline.

A typical collection of screen-space effects were

Tone-mapping + HDR rendering: the tone-mapper can be considered a dynamic contrast operator
Camera effects like Depth of Field with shaped Bokeh, Motion Blur, lens flare etc..
Full-screen color filters like contrast, saturation, color additions and multiplications etc..

One of the first coverages of a whole collection of effects in a Post-Processing Pipeline running on XBOX 360 / PS3 was done in [Engel2007].

Since then numerous new tone mapping operators were introduced [Day2012], new more advanced Depth of Field algorithms with shaped Bokeh were covered but there was no fundamental change to the concept of the pipeline.

Call for a new Post-Processing Pipeline

Let's start with the color space: RGB is not a good color space for a post-processing pipeline. It is well known that luminance variety is more important than color variety, so it makes sense to pick a color space that has luminance in one of the channels. With the 11:11:10 render targets it would be cool to store luminance in one of the 11 bit channels. Having luminance available in the pipeline without having to go through color conversions opens up many new possibilities, from which we will cover a few below.

Global tone mapping operators didn't work out well in practice. We looked at numerous engines in the last four years and a common decision by artists was to limit the luminance values by clamping them. The reasons for this were partially in the fact that the textures didn't provide enough quality to survive a "light adaptation" without blowing out or sometimes most of their resolution was in the low-end greyscale values and there wasn't just enough resolution to mimic light adaptations.

Another reason for this limitation was that the available resolution in the rendering pipeline with the RGB color space was not enough. Another reason for this limitation is the fact that we limited ourselves to Global tone mapping operators, because local tone mapping operators are considered too expensive.

A fixed global gamma adjustment at the end of the pipeline is partially doing "the same thing" as the tone mapping operator. It applies a contrast and might counteract the activities that the tone-mapper already does.

So the combination of a tone-mapping operator and then the commonly used hardware gamma correction, which are both global is odd.

On a lighter note, a new Post-Processing Pipeline can add more stages. In the last couple of years, screen-space ambient occlusion, screen-space skin and screen-space reflections for dynamic objects became popular. Adding those to the Post-Processing Pipeline by trying to re-use existing resources need to be considered in the architecture of the pipeline.

Last, one of the best targets for the new compute capabilities of GPUs is the Post-Processing Pipeline. Saving memory bandwidth by merging "render target blits" and re-factoring blur kernels for thread group shared memory or GSM are considerations not further covered in the following text; but most obvious design decisions.

Let's start by looking at the an old Post-Processing Pipeline design. This is an overview I used in 2007:

A Post-Processing Pipeline Overview from 2007

A few notes on this pipeline. The tone mapping operation happens at two places. At the "final" stage for tone-mapping the final result and in the bright-pass filter for tone mapping the values before they can be considered "bright".

The "right" way to apply tone mapping independent of the tone mapping operator you choose is to convert into a color space that exposes luminance, apply the tone mapper to luminance and then convert back to RGB. In other words: you had to convert between RGB and a different color space back and forth twice.

In some pipelines, it was decided that this is a bit much and the tone mapper was applied to the RGB value directly. Tone mapping a RGB value with a luminance contrast operator led to "interesting" results.

Obviously this overview doesn't cover the latest Depth of Field effects with shaped Bokeh and separated near and far field Center of Confusion calculations, nevertheless it shows already a large amount of render-target to render-target blits that can be merged with compute support.

All modern rendering pipelines calculate color values in linear space; meaning every texture that is loaded is converted into linear space by the hardware, then all the color operations are applied like lighting and shadowing, post-processing and then at the end the color values are converted back by applying the gamma curve.

This separate Gamma Control is located at the end of the pipeline, situated after tone mapping and color filters. This is because the GPU hardware can apply a global gamma correction to the image after everything is rendered.

The following paragraphs will cover some of the ideas we had to improve a Post-Processing Pipeline on a fundamental level. We implemented them into our Post-Processing Pipeline PixelPuzzle. Some of the research activities like finally replacing the "global tone mapping concept" with a better way of calculating contrast and color will have to wait for a future column.

Yxy Color Space

The first step to change a Post-Processing Pipeline in a fundamental way is to switch it to a different color space. Instead of running it in RGB we decided to use CIE Yxy through the whole pipeline. That means we convert RGB into Yxy at the beginning of the pipeline and convert back to RGB at the end. In-between all operations run on Yxy.

With CIE Yxy, the Y channel holds the luminance value. With a 11:11:10 render target, the Y channel will have 11 bits of resolution.

Instead of converting RGB to Yxy and back each time for the final tone mapping and the bright-pass stage, running the whole pipeline in Yxy means that this conversion might be only done once to Yxy and once or twice back to RGB.

Tone mapping then still happens with the Y channel in the same way it happened before. Confetti's PostFX pipeline offers eight different tone mapping operators and each of them works well in this setup.

Now one side effect of using Yxy is also that you can run the bright-pass filter as a one channel operation, which saves on modern scalar GPUs some cycles.

One other thing that Yxy allows to do is to consider the occlusion term in Screen-Space Ambient Occlusion as a member of the Y channel. So you can mix in this term and use it in interesting ways. Similar ideas apply to any other occlusion term that your pipeline might be able to use.

The choice of using CIE Yxy as the color space of choice was arbitrary. In 2007 I evaluated several different color spaces and we ended up with Yxy at the time. Here is my old table:

Pick a Color Space Table from 2007

Compared to CIE Yxy, HSV doesn't allow easily to run a blur filter kernel. The target was to leave the pipeline as unchanged as possible when picking a color space. So with Yxy, all the common Depth of Field algorithms and any other blur kernel runs unchanged in Yxy. HSV conversions also seem to be more expensive compared to RGB -> CIE XYZ -> CIE Yxy and vice versa.

There might be other color spaces similar tailored to the task.

Dynamic Local Gamma

As mentioned above, the fact that we apply a tone mapping operator and then later on a global gamma operator appears to be a bit odd. Here is what the hardware is supposed to do when it applies the gamma "correction".

Gamma Correction

The main take-away from this curve is that the same curve is applied to every pixel on screen. In other words: this curve shows an emphasis on dark areas independently of the pixel being very bright or very dark.

Whatever curve the tone-mapper will apply, the gamma correction might be counteracting it.

It appears to be a better idea to move the gamma correction closer to the tone mapper, making it part of the tone mapper and at the same time apply gamma locally per pixel.

In fact gamma correction is considered depending on the light adaptation level of the human visual system. The "gamma correction" that is applied by the eye changes the perceived luminance based on the eye's adapatation level [Bartleson 1967] [Kwon 2011].

When the eye is adapted to dark lighting conditions, the exponent for the gamma correction is supposed to increase. If the eye is adapted to bright lighting conditions, the exponent for the gamma correction is supposed to decrease. This is shown in the following image taken from [Bartleson 1967]:

Changes in Relative Brightness Contrast [Bartleson 1967]

A local gamma value can vary with the eye's adaptation level. The equation that adjusts the gamma correction following the current adaptation level of the eye can be found in [Kwon 2011].

γv=0.444+0.045 ln(Lan+0.6034)

For this presentation, this equation was taken from the paper by Kwon et all. Depending on the type of game there is an opportunity to build your own local gamma operator.

The input luminance value is generated by the tone mapping operator and then stored in the Y channel of the Yxy color space:

YYxy=Lγv

γv changes based on the luminance value of the current pixel. That means each pixels luminance value might be gamma corrected with a different exponent. For the equation above, the exponent value is in the range of 0.421 to 0.465.

Applied Gamma Curve per-pixel based on luminance of pixel

•Eye’s adaptation == low - >blue curve

•Eye’s adaptation value == high -> green curve

Lγv
works with any tone mapping operator. L is the luminance value coming from the tone mapping operator.

With a dynamic local gamma value, the dynamic lighting and shadowing information that is introduced in the pipeline will be considered for the gamma correction. The changes when going from bright areas to dark areas appear more natural. Textures are holding up better the challenges of light adaptation. Overall lights and shadows look better.

Depth of Field

As a proof-of-concept of the usage of Yxy color space and the local dynamic gamma correction, this section is showing screen-shots of a modern Depth of Field implementation with separated near and far field calculations and a shaped Bokeh, implemented in compute.

Producing an image through a lens leads to a "spot" that will vary in size depending on the position of the original point in the scene:

Circle of Confusion (image taken from Wikipedia)

The Depth of Field is the region, where the CoC is less than the resolution of the human eye (or in our case the resolution of our display medium). The equation on how to calculate the CoC [Potmesil1981] is:

Following the variables in this equation, Confetti demonstrated in a demo at GDC 2011 [Alling2011] the following controls:

F-stop - ratio of focal length to aperture size
Focal length – distance from lens to image in focus
Focus distance – distance to plane in focus

Because the CoC is negative for far field and positive for near field calculations, separate results are commonly generated for the near field and far field of the effect [Sousa13].
Usually the calculation of the CoC is done for each pixel in a down-sampled buffer or texture. Then the near and far field results are generated. Then, first, the far and focus field results are combined and then this result is combined with the near field, based on a near field coverage value. The following screenshots show the result of those steps, with the first screenshot showing the near and far field calculations:

Red = max CoC(near field CoC)

Green = min CoC(far field CoC)

Here is a screenshot of the far field result in Yxy:

Far field result in Yxy

Here is a screenshot of the near field result in Yxy:

Near field result in Yxy

Here is a screenshot of resulting image after it was converted back to RGB:

Resulting Image in RGB

Conclusion

A modern Post-Processing Pipeline can benefit greatly from being run in a color space that offers a separable luminance channel. This opens up new opportunities for an efficient implementation of many new effects.

With the long-term goal of removing any global tone mapping from the pipeline, a dynamic local gamma control can offer more intelligent gamma control that is per-pixel and offers a stronger contrast of bright and dark areas, considering all the dynamic additions in the pipeline.

Any future development in the area of Post-Processing Pipelines can be focused on a more intelligent luminance and color harmonization.

References
[Alling2011] Michael Alling, "Post-Processing Pipeline", http://www.conffx.com/GDC2011.zip

[Bartleson 1967] C. J. Bartleson and E. J. Breneman, “Brightness function: Effects of adaptation,” J. Opt. Soc. Am., vol. 57, pp. 953-957, 1967.
[Day2012] Mike Day, “An efficient and user-friendly tone mapping operator”, http://www.insomniacgames.com/mike-day-an-efficient-and-user-friendly-tone-mapping-operator/
[Engel2007] Wolfgang Engel, “Post-Processing Pipeline”, GDC 2007 http ://www.coretechniques.info/index_2007.html
[Kwon 2011] Hyuk-Ju Kwon, Sung-Hak Lee, Seok-Min Chae, Kyu-Ik Sohng, “Tone Mapping Algorithm for Luminance Separated HDR Rendering Based on Visual Brightness Function”, online at http://world-comp.org/p2012/IPC3874.pdf
[Potmesil1981] Potmesil M., Chakravarty I. “Synthetic Image Generation with a Lens and Aperture Camera Model”, 1981
[Reinhard] Erik Reinhard, Michael Stark, Peter Shirley, James Ferwerda, "Photographic Tone Reproduction for Digital Images", http://www.cs.utah.edu/~reinhard/cdrom/
[Sousa13] Tiago Sousa, "CryEngine 3 Graphics Gems", SIGGRAPH 2013, http://www.crytek.com/cryengine/presentations/cryengine-3-graphic-gems

KGC 2013

2013-09-13T10:32:00.003-07:00

I will be a speaker on the Korean Game Developer Conference this year. This is my third time and I am very much enjoying it.
This year I want to talk about building a next-gen Post-Processing Pipeline. Most people haven't change their PostFX pipeline algorithms since 6 or 7 years ( ... no re-writing it in compute doesn't count ... also replacing your Reinhard operator with an approx. Hable operator: check out Insomniac's website doesn't count either :-) ).

Please come by and say hi if you are around.

TressFX - Crystal Dynamics and AMD cover TressFX on SIGGRAPH

2013-07-29T11:48:00.004-07:00

There were more talks about Confetti's work on TressFX on SIGGRAPH: One talk by Jason Lacroix was: "Adding More Life to Your Characters With TressFX".

Activision's head demo uses TressFX as well: "Digital Ira: High-Resolution Facial Performance Playback".

If you are a registered developer and you need XBOX One or PS4 implementations, send me an e-mail.

SIGGRAPH 2013

2013-07-25T11:07:00.002-07:00

I would like to highlight the talk "Crafting a Next-Gen Material Pipeline for The Order: 1886":

http://blog.selfshadow.com/publications/s2013-shading-course

The 3D Fabric Scanner is a fantastic idea and the results are awesome. Those are next-gen characters. Great work!