Sunday, May 31, 2015

Multi-GPU Game Engine

Many high-end rendering solutions for -for example- battlefield simulations can utilize now hardware solutions with multiple consumer GPUs. The idea is to split up computational power in-between 4 - 8 GPUs to increase the level of realism as much as possible.
Now with more modern APIs like DirectX 12 and probably Vulcan and before that CUDA, splitting up the rendering pipeline can happen in the following way:
- GPU0 - fills up the G-Buffer after a Z pre-pass
- GPU1 - Renders Deferred Lights and Shadows
- GPU2 - Renders Particles and Vegetation
- GPU3 - Renders Screen-Space Materials like skin etc. and PostFX

Now you can use the result of GPU0 and feed it to GPU1 and then feed it to GPU2 and so on. All this will run in parallel but will introduce two or three frames of lag (depending on how you light Particles and Vegetation). As long as the system renders 60 fps or 120 fps this will not be as much noticeable (obviously one of the targets is to have a high framerate to make animations look smooth and then also having 4K resolution rendering). GPU4 and higher can work on Physics, AI and other things. There is also the opportunity to spread out G-Buffer rendering over several GPUs, like one GPU is doing the Z pre-pass, then another fills up diffuse, normal and probably some geometry data to indentify different objects later or store their edges and another GPU is filling up the terrain data. Vegetation can be rendered on a dedicated GPU etc. etc.. On the CPU side the rule of thumb is that at least 2 cores are needed for one GPU. It is probably better to go for 3 or four. So a four GPU machine should have 8 - 16 CPU cores and a eight GPU machine 16 - 32 CPU cores; which might be split between several physical CPUs. We need at least 2x as much CPU RAM as the GPUs have RAM, so if four GPUs have each 2 GB, we need at least 16 GB Ram, if we have eight GPUs, we need at least 32 GB RAM etc..
A 4K resolution consists of 3840 × 2160 pixels and it will occupy with four render targets, each 32-bit per pixel (8:8:8:8 or 11:11:10), roughly 126.56 MB. This number goes up with 4x or 8x MSAA and maybe super-sampling. It is probably save to assume that the G-Buffer might occupy between 500 and 1GB.
Achieving a frametime of 8 - 16ms, means that even a high-end GPU will be quite busy to fill up a G-Buffer this size. So thinking about splitting this between two GPUs might make sense.
A high-end PostFX pipeline is now < 5ms on medium-class GPUs but dedicating a whole high-end GPU means we can finally switch on the movie settings :-)
A GPU particle system can easily saturate a GPU with 16 ms ... especially if it is not rendering in a quarter size resolution.
For Lights and shadows it depends on the number of lights that should be applied. Caching all the shadow data in partially resident textures or cube maps or any other shadow map technique will hit the memory budget of this card substantially.

Note: I wrote this more than two years ago. At the time a G-Buffer was a valid solution for designing a rendering system. Now with the high-res displays it is not anymore.

Wednesday, May 27, 2015

V Buffer - Deferred Lighting Re-Thought

After eight years I would like to go back to re-design the existing rendering systems, so that they are capable to run more efficiently on high-resolution devices and display more lights with attached shadows.

Let's first see where we are: the Light Pre-Pass was introduced in March 2008 on this blog. At this point I had it already running in one R* game for a while. It eventually shipped in a large number of games and also outside of R*. The S.T.A.L.K.E.R series and the games developed by Naughty Dog had at the time a similar approach. Since then a number of modifications were proposed.
One modification was to calculate lighting by tiling the G-Buffer, then sorting lights into those tiles and then execute each tile with its light. Johan Andersson covered a practical implementation in "DirectX 11 rendering in Battlefield 3" ( Before Tiled-Deferred, lights were additively blended into a buffer, consuming memory bandwidth with each blit. The Tiled-Deferred approach reduced memory bandwidth consumption substantially by resolving all the lights in one tile.
The drawback of this approach is the higher minimum run-time cost. Sorting the lights into the tiles raised the "resting" workload even when only a few lights were rendered. Compared to the older approaches it didn't break even until one rendered a few dozen lights. Additionally as soon as lights had to be drawn with shadows, the memory bandwidth savings were negligible.
Newer approaches like "Clustered Deferred and Forward Shading" ( by Ola Olsson et all. started solving the "light overdraw" problem in even more efficient ways. A practical implementation is shown on Emil Perrson's website ( in an example program.
Because transparency solutions with all the approaches mentioned above are inconsistent with the way opaque objects are handled, there was a group of people that wishes to go back to forward rendering. Takahiro Harada described and refined an approach that he called Forward+ ( The tiled-based handling of light sources was similar to the Tiled-Deferred approach. The advantage of having a consistent way of lighting transparent and opaque objects was bought by having to re-submit all potentially visible geometry several times.

Filling a G-Buffer or re-submitting geometry in case of Forward+ is expensive. For the Deferred Lighting implementations, the G-Buffer fill was the stage were also visibility of geometry was solved (there is also the option of making a Z Pre-Pass which means geometry is submitted one more time at least).
With modern 4k displays and high-res devices like Tablets and smart phones, a G-Buffer is not a feasible solution anymore. When the Light Pre-Pass was developed a 1280x720 resolution was considered state of the art. Today 1080p is considered the minimum resolution, iOS and Android devices have resolutions several times this size and even modern PC monitors can have more than 4K resolution.
MSAA increases the size and therefore cost manifold.

Instead of rendering geometry into three or four render targets with overdraw (or re-submitting after the Z prepass), we need to find a way to store visibility data separate in a much smaller buffer, in a more efficient way.
In other words, if we could capture the full-screen visibility of geometry in as small a footprint as possible, we could significantly reduce the cost of geometry submission and pixel overdraw afterwards.

A first idea on how this could be done is described in the article "The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading" by Christopher A. Burns et all.. The article outlines the idea to store per triangle visibility data in a visibility buffer.