The following figure from the ShaderX7 article shows how MSAA works:
The pixel represented by a square has two triangles (blue and yellow) crossing some of its sample points. The black dot represents the pixel sample location (pixel center); this is were the pixel shader is executed. The cross symbol corresponds to the location of the multisamples where the depth tests are performed. Samples passing the depth test receive the output of the pixel shader. Those samples are replicated by the MSAA back-end into a multisampled render target that represents each pixel with -in that case- four samples. That means the render target size for an intended resolution of 1280x720 would be 2560x1440 representing each pixel with four samples but the pixel shader only writes 1280x720 times (assuming there is no overdraw) while the MSAA back-end replicates for each pixel four samples into the multisampled render target.
With deferred lighting there can be several of those multi-sampled render targets as part of a Multiple-Render-Target (MRT). In the so called Geometry stage, data is written into this MRT; therefore called G-Buffer. In case of 4xMSAA each of the render targets of the G-Buffer would be 2560x1440 in size.
In case of Deferred Lighting / Light Pre-Pass the G-Buffer holds normal and depth data. This data can never be resolved because resolving it would lead to incorrect results as shown by Nicolas in his article.
After the Geometry phase comes the Lighting or Shading phase in a Deferred Lighting/Light Pre-Pass/Deferred Shading renderer. In an ideal world you could blit each sample (not pixel) into the multisampled render target -that holds the result of the Shading phase- by reading the G-Buffer sample and performing all the calculations necessary on it.
In other words to achieve the best possible MSAA quality with those renderer designs, lighting equations would need to be applied on a per-sample basis into a multisampled render target and then later resolved.
This is possible with DirectX 10.1 graphics hardware (AMD's 10.1 capable cards; didn't try if S3 cards that support 10.1 can do this as well) that allows to execute a pixel shader at sample frequency.
To make this a viable option, this operation needs to be restricted to samples that belong to pixel edges. There are two passes necessary to make this work. One pass will use the pixel shader that runs operations performed on samples and in a second pass the pixel shader is run that performs operations per-pixel, which means the result of the pixel shader calculation is output to all samples passing the depth-stencil test.
To restrict the pixel shader that performs operations per-sample, a stencil test is used.
One interesting idea covered in the article is to detect edges with centroid sampling (available already on DirectX9 class graphics hardware). During the G-Buffer phase the vertex shader writes a variable unique to every pixel (e.g. pixel position data) into two outputs, while the associated pixel shader declares two inputs: one without and one with centroid sampling enabled. The pixel shader then compares the centroid-enabled input with the one without it. Differing values mean that samples were only partially covered by the triangle, indicating an edge pixel. A "centroid value" of 1.0 is then written out to a selected area of the G-Buffer (previously cleared to 0.0) to indicate that the covered samples belong to an edge pixel. Those values are then averaged while being resolved to find out the value per pixel. If the result is not exactly 0, then the current pixel is an edge pixel. This is shown in the following image from the article.
On the left the pixel shader input will always be evaluated at the center of the pixel regardless of whether it is covered by the triangle. On the right with centroid sampling, the two rightmost depth samples are covered by the triangle. The comparison of the values in the pixel shader will lead to the result that the samples were only partially covered by the triangle, indicating an edge pixel.
Because DirectX10 capable graphics hardware does not support the pixel shader running at sample frequency, a different solution needs to be developed here.
The best MSAA quality in that case is achieved by running the pixel shader multiple times per pixel, only enabling output to a single sample each pass. This can be achieved by using the OMSetBlendState() API. The results of this method would be identical to the DirectX 10.1 method but obviously due to the increased number of rendering passes and slightly reduced texture cache effectiveness more expensive.