Friday, July 3, 2009

MSAA on the PS3 with Light Pre-Pass on the SPU

In the previous "MSAA on the PS3" thread Matt Swoboda jumped in and mentioned that they implemented MSAA on the SPU in the Phyre Engine. I knew that they implemented the Light Pre-Pass on the SPU but I completely forgot that they also had a solution to do MSAA on the SPU.
You can find the presentation "Deferred Lighting and Post Processing on PLAYSTATION®" here.
Because it is possible to read and write per sample with the SPU, they can achieve a similar functionality as the per-sample frequency of DirectX 10.1-class graphics hardware where each sample can be treated separately. So they can calculate the lighting for each of the sample values and write the results into each of the samples in the light buffer.


Pat Wilson said...

Thanks, Wolfgang. This is going to be very-very useful, very-very shortly.

Now if I can only find out where the atomic operations intrinsics are defined, I'll be golden!

Anonymous said...

You don't need SPUs to 'emulate' DX10.1 MSAA capabilities, given that RSX can read back compressed subsamples at full speed.

Wolfgang Engel said...

This is interesting. Can the RSX write per sample? I guess you just let the pixel shader run for each sample then?

Wolfgang Engel said...

... I definitely need a PS3 DevKit very soon ... just didn't have time to take care of this. I would love to try a proper MSAA implementation on the hardware with a stencil buffer filled with edge data based on centroid sampling.

Matt said...

On PS3 an MSAA surface is more or less just a surface with scaled dimensions (2x width for MSAA 2x, 2x width and height for MSAA 4x).
So yes, you can use the "actual-size" (2x width, 2x height) surface as a texture and read all the samples individually - same as we do on SPU. No problems there at all, custom MSAA resolves and so on are common practice.

Writing-wise - you can sort of write to the "samples" individually by writing to a very large "actual-size" surface, but that can be prohibitive in terms of performance.
It's a little bit more complicated than that. There's rules about mixing and matching MSAA and non-MSAA surfaces. Can make using it for lighting quite ugly.

The problem with using the stencil for edge information is that as the hardware doesnt have an early Z unit you're limited to ZCull/SCull tile granularity. So you will definately end up processing more than you wanted to. And then there's the cost of generating that stencil in the first place.

I should probably stop flooding this post's comments now, so if anyone wants more information or help with all this drop me a mail or catch me on PS3 devnet. :)

James Stanard said...

You can not only read each sample individually, you can also write to each sample individually using a sample coverage mask. In one pass, you set the mask to write to all left samples while reading only left samples. Then in the second pass, you shade all right samples. In my implementation, doing true MSAA deferred lighting was about 10% more expensive than doing it all in one pass (averaging the shaded samples), and provided noticeably better antialiasing (precise Quincunx filtering), especially around sky silhouettes.

Treating the surface as a non-MSAA buffer, i.e. doubling the pixels rather than the samples, could break ZCull, which is why I would prefer to keep treating the buffer as MSAA and write the separate samples in separate passes.

This approach could extend to 4xMSAA as well if you can pay the cost.

Wolfgang Engel said...

James, if you are interested, please send me your e-mail address. I would like to follow up regarding with you.

David Farrell said...

We did some research at Nihilistic, and implemented an SPU/RSX renderer that tried both the light prepass and deferred lighting methods. We were using 2X MSAA and HDR, and FWIW I found the deferred lighting solution was more efficient.

The light prepass had more overhead because the SPUs would write out two LogLuv encoded values for diffuse and specular. In the final forward pass, the RSX would then have to read both the diffuse and specular values, decode them from LogLuv, do the final shading, then re-encode that value back to LogLuv and write to a 2X MSAA framebuffer. Finally, there was the MSAA resolve pass that would read both the 2X MSAA samples, decode those from LogLuv, blend them and write them out as HDR values.

The deferred lighting tended to be more efficient because the SPUs could just read the geometry attributes directly, perform the lighting on SPU in floating point to support HDR, blend the two 2X MSAA samples, and write out the final resolved HDR value.

If you take MSAA or HDR out of the equation, then the tradeoffs change. Also, a lot of choices depend on how much memory you can spare in either main or local memory.

Wolfgang Engel said...

The light prepass had more overhead because the SPUs would write out two LogLuv encoded values for diffuse and specular.
I believe you mixed something up here. Light Pre-Pass uses only one light buffer that stores diffuse in rgb and specular in alpha. You don't have to read two textures with the RSX in that case. Crytek calls this deferred lighting because they think it would make more sense to keep it close to the term Deferred shading. I think Light Pre-Pass is better because you do the lighting as a pre-pass before the scene rendering ...

Wolfgang Engel said...

Regarding LogLuv or L16uv that I like more: you don't need to use this on the light buffer. I would just use it on the main rendering path.
You might check out Pat Wilson's ShaderX7 article that describes a solution that stores LUV in the light buffer. He only uses four channels.
The way it works is that L is represented by N.L and then the other two channels store u and v and then specular is stored in the fourth channel.