One of the first thing I wanted do try on the GF100 was the new NVIDIA extensions that allows random access read/write and atomic operations into global memory and textures, to implement a fast A-Buffer !
It worked pretty well since it provides something like a 1.5x speedup over the fastest previous approach (at least I know about !), with zero artifact and supporting arbitrary number of layers with a single geometry pass.
Sample application sources and Win32 executable:
Sources+executable+Stanford Dragon model
Additional models
Be aware that this will probably only run on a Fermi card (Forceware drivers >=257.15). In particular it requires: EXT_shader_image_load_store, NV_shader_buffer_load, NV_shader_buffer_store, EXT_direct_state_access.
Application uses freeglut in order to initialize an OpenGL 4.0 context with the core profile.
Keys:
- 'a' Enable/Disable A-Buffer
- 's' Enable/Disable fragments sorting. Disable= closest fragment kept during resolve.
- 'g' Swith between Alpha-Blending and Gelly resolve modes.
- 'c' Enable/Disable alpha correction when in Alpha-Blending mode.
- 't' Swith between using textures or global memory for A-Buffer storage.
- '1'-'3' Change mesh (requires the additional models).
A-Buffer:
Basically an A-buffer is a simple list of fragments per pixel [Carpenter 1984]. Previous methods to implement it on DX10 generation hardware required multiple passes to capture an interesting number of fragments per pixel. They where essentially based on depth-peeling, with enhancements allowing to capture more than one layer per geometric pass, like the k-buffer, stencil routed k-buffer. Bucket sort depth peeling allows to capture up to 32 fragments per geometry pass but with only 32 bits per fragment (just a depth) and at the cost of potential collisions. All these techniques were complex and especially limited by the maximum of 8 render targets that were writable by the fragment shader.
This technique can handle arbitrary number of fragments per pixels in a single pass, with only limitation the available video memory. In this example, I do order independent transparency with fragments storing 4x32bits values containing RGB color components and the depth.
Technique:
The idea is very simple: Each fragment is written by the fragment shader at it's position into a pre-allocated 2D texture array (or a global memory region) with a fixed maximum number of layers. The layer to write the fragment into is given by a counter stored per pixel into another 2D texture and incremented using an atomic increment (or addition) operation ( [image]AtomicIncWrap or [image]AtomicAdd). After the rendering pass, the A-Buffer contains an unordered list of fragments per pixel with it's size. To sort these fragments per depth and compose them on the screen, I simply use a single screen filling quad with a fragment shader. This shader copy all the pixel fragments in a local array (probably stored in L1 on Fermi), sort them with a naive bubble sort, and then combine them front-to-back based on transparency.
Performances:
To compare performances, this sample also features a standard rasterization mode which renders directly into the color buffer. On the Stanford Dragon example, a GTX480 and 32 layers in the A-Buffer, the technique range between 400-500 FPS, and is only 5-20% more costly than a simple rasterization of the mesh.
I also compared performances with the k-buffer which code is available online (still be careful, it may not be super optimized). On the GTX480, with the same model and shading (and 16 layers), I can get more than a 2x speedup. Based on that results, I strongly believe that it is also close to 1.5x faster than the bucket sort depth peeling, without it's depth collision problems.
EDIT: Artifacts in stencil-routed k-buffer came from a bug in DXUT, images removed. Also added a warning about the performances of the k-buffer OpenGL code from Louis Bavoil page.
EDIT 2: The follow-up of this work using per-pixel linked-lists can also be read there: http://blog.icare3d.org/2010/07/opengl-40-abuffer-v20-linked-lists-of.html
June 9, 2010 at 7:47 PM
Niceee!!
This is actually probably more an OpenGL 4.1 method (or beyond)!
It doesn't work with AMD drivers like you expected.
I might I miss something but you have to allocate a big buffer at the beginning for the written fragments? A bit like the AMD order independent transparency but with a 2D texture array instead of RWBuffer?
One thing which feels not so practical is the amount of memory required for the A-Buffer even in areas where actually no fragment are written. Yes, at least it doesn't consume any bandwidth...
There are solutions explored back in the old days where no extra memory and no extra bandwidth were required but that never made it in nVidia and AMD chips... I don't really understand.
June 9, 2010 at 8:25 PM
Hi Christophe !
You are right you have to allocate a big buffer at the beginning with the maximum number of layers on every pixels. That the main problem with this technique, but it was also the case with previous techniques (and as you said, here it does not consume any bandwidth).
Can you elaborate a little bit more on old days solutions you are talking about ?
June 9, 2010 at 9:59 PM
Does your technique store a fixed amount of fragments per pixel? Or is it flexible and it allows you to store a variable number of fragments per pixel? (similarly to what AMD does in their OIT demo..)
June 9, 2010 at 10:06 PM
Yes there do is a fixed maximum amount of fragments per pixel, but the actual number is maintained dynamically (that's the per-pixel counter updated each time a fragment is written).
June 10, 2010 at 12:37 PM
Erm... Actually the AMD method doesn't have a fixed number of fragment per pixel.
That's where their solution is slighly more interesting. A single pixel can contain 100 fragments even if the average fragment per pixel is 16 for example. It meat that we expect that some pixel will require less than 16 samples which is likely to happen. I also suspect that the AMD method is faster but this is difficult to actually check.
Well this is still a good job but I am looking forward for the further result of your research! :p
June 10, 2010 at 12:42 PM
I dis not know that technique, do you have a reference to a paper or a presentation ?
Thanks !
June 10, 2010 at 4:44 PM
That's very nice work - keep it up!
It looks like this is something that can be extended
to implement translucent shadow maps? During the depth
pass, things are rendered exactly like what you've
described here (except that the final pass for merging
is skipped). During the render pass, a Z-search is
done on the A-buffer to look for where on the list
the view space pixel is. If there're samples between
this pixel and the light (i.e. shadowed by one or more
layers), then all the samples are alpha-blended to give
the final colour.
Not sure if the above makes any sense.. :-)
June 10, 2010 at 7:02 PM
AMD method uses per pixel linked lists (and it's very fast), they have a paper at EGSR that explains their method and a few applications.
Translucent shadow maps are possible (I have implemented a few different solutions) but generally very slow to sample. There are different and imho better ways to approach that problem and we have a paper at EGSR to show an alternative approach.
See the "Shadow and Order Independent Transparency" session:
http://kesen.realtimerendering.com/egsr2010Papers.htm
June 10, 2010 at 9:18 PM
Awesome work Cyril. Always nice to see someone using OpenGL 4.0 with the core profile. :) Sure, there are plenty of other OIT algorithms out there, but A-Buffer is nice and simple. I love it!
June 10, 2010 at 9:43 PM
The AMD method uses a per pixel linked list and is described in a GDC 2010 presentation. I assume the EGSR presentation will have more examples/uses.
http://developer.amd.com/gpu_assets/OIT%20and%20Indirect%20Illumination%20using%20DX11%20Linked%20Lists_forweb.ppsx
June 11, 2010 at 11:15 AM
@Rex Guo : It does make sense. It is an interesting application of all A-Buffer approaches I think.
@pixelstoomany : Thanks for the link !
@id: Thanks :-)
@Todd: Thanks also for the link, I will read this !
June 14, 2010 at 8:29 PM
Is there a list of cards this will run on? My 470 won't run the demo.
June 14, 2010 at 10:22 PM
Hi Ben, a 470 should run it, do you get an error message on the console ? What driver are you using ?
June 15, 2010 at 3:37 PM
freeglut (ABufferGL4.exe): Unable to create OpenGL 4.0 context (flags 2, profile 1)
drivers: 197.75
June 15, 2010 at 3:50 PM
I think you need an R256 drivers to get OpenGL 4.0 support. You should try the new 257.21
June 27, 2010 at 8:43 AM
I compiled it on GCC4/Linux. Some notes:
- in Matrix4.h, "static Mat4 reflection(const Vector4 &plane)" does not compile because "v" is undeclared
- I had to include and for fprintf, stderr and strcmp to be declared in ShadersManagment.cpp
- it should be "Management" and not "Managment" ;)
Great technique BTW, too bad I only have a GeForce 8600M GT :/
When running glewinfo, I get "OK [MISSING]" for the extensions you use, which means the functions are here but not in GL_EXTENSIONS.
Do you know if these extensions could work anyway on my hardware, or if they are just present because the driver is the same for Fermi and G80 hardware?
June 19, 2012 at 11:38 AM
Hi,Cyril.It's nice work. I compile your source code in win7 64bit successfully,but it can't run.
Here is the error message
[display]GL error invalid operation
my card is GTX560 and driver is 301.42.I update the glew and freeglut in 64bit,but it doesn't work.
Thanks.
June 21, 2012 at 4:20 AM
I have solved this problem. Just add VAO in the "display" function.Thanks.
August 4, 2012 at 7:05 PM
I get "[glewInit] GL error invalid enumerant" on NVIDIA GTX 580M.
September 21, 2012 at 4:29 AM
@qwcbeyond can you please clarify your solution?
November 2, 2012 at 11:57 PM
Hi Cyril,
I'm trying to implement your fragment shader critical sections but I'm running into some problems. I hope you can offer some advice on how to debug them.
I've implemented my semaphores just like you do, with a 2D texture of uints. The critical sections seem to work fine for simply counting per-pixel fragments. Specifically, in each critical section I can do something like:
imageStore(coords, pixelFragCounts, ivec4(curFragIdx));
Just to be clear: this works fine.
However, if I do any atomic operations (like incrementing a global counter, or some atomic operation on an arbitrary texture) inside the critical section then my applications behaves inconsistently. Specifically, when I visualize the per-pixel fragment counts I see flickering pixels, so the per pixel fragment count is incorrect (it is always less than the correct value). This seems to imply the imageStore is not behaving as expected. But the imageStore should be completely independent of any other atomic operation. So I'm stumped.
Did you run into any problems like this with your implementation? Can you offer any debugging advice?
Thanks,
Ethan
February 12, 2013 at 12:31 AM
Hi. I tried running the executable on an Intel HD4000 card with a subsequent crash. "In theory" this card can do OpenGL 4.0. :P. Well, hopefully it will help. Here's the program output before the crash.
http://pastebin.com/c2qPb8th
Thank you for your time!
May 30, 2013 at 11:04 AM
I fixed the error which is caused by using VAO draw while none exists.Here is the link to fixed source file : https://dl.dropboxusercontent.com/u/22111229/ABufferGL4.rar