The main problem with my first ABuffer implementation (cf. my previous post) was that a fixed maximum number of fragments per pixel has to be allocated at initialization time. With this approach, the size of the ABuffer can quickly become very large when the screen resolution and depth complexity of the scene increase.
Using linked lists of fragment pages per pixel
Original basic approach
To try to solve this problem, I implemented a variant of the recent OIT method presented at the GDC2010 by AMD and using per-pixel linked lists. The main difference in my implementation is that fragments are not stored and linked individually but into small pages of fragments (containing 4-6 fragments). Those pages are stored and allocated in a shared pool whose size is changed dynamically depending on the scene demands.
Using pages allows to increase the cache coherency when accessing the fragments, improve the efficiency of concurrent access to the shared pool and decrease the storage cost of the links. This is at the cost of a slight over-allocation of fragments.
The shared pool is composed of a fragment buffer where fragment data is stored, and a link buffer storing links between the pages that are reverse chained. Each pixel of the screen contains the index of the last page it references, as well as a counter with the total number of fragments stored in that pixel (incremented using atomics).
The access to the shared pool is manage through a global page counter, incremented using an atomic operation each time a page is needed by a fragment. The allocation of a page is done by a fragment when it detects that the current page is full, or there is not any page yet for the pixel. This is done inside a critical section to unsure that multiple fragments together in the pipeline and falling into the same pixel will be handled correctly.
Original basic approach
To try to solve this problem, I implemented a variant of the recent OIT method presented at the GDC2010 by AMD and using per-pixel linked lists. The main difference in my implementation is that fragments are not stored and linked individually but into small pages of fragments (containing 4-6 fragments). Those pages are stored and allocated in a shared pool whose size is changed dynamically depending on the scene demands.
Using pages allows to increase the cache coherency when accessing the fragments, improve the efficiency of concurrent access to the shared pool and decrease the storage cost of the links. This is at the cost of a slight over-allocation of fragments.
The shared pool is composed of a fragment buffer where fragment data is stored, and a link buffer storing links between the pages that are reverse chained. Each pixel of the screen contains the index of the last page it references, as well as a counter with the total number of fragments stored in that pixel (incremented using atomics).
The access to the shared pool is manage through a global page counter, incremented using an atomic operation each time a page is needed by a fragment. The allocation of a page is done by a fragment when it detects that the current page is full, or there is not any page yet for the pixel. This is done inside a critical section to unsure that multiple fragments together in the pipeline and falling into the same pixel will be handled correctly.
ABuffer memory occupancy differences:
Some memory occupancy examples of the fragments storage depending on screen resolution (Basic vs Linked Lists):
- 512x512: 64MB vs 6.5MB
- 768x708: 132.7MB vs 11.7MB
- 1680x988: 405MB vs 27.42MB
The cost of this huge reduction of the storage need is that the rendering speed decreases compared to the basic approach. Linked lists can be down to half the speed of the basic approach when per-fragment additional costs are low, due to the additional memory access and the increased complexity of the fragment shader (more code, more registers). But this cost seems well amortized when the shading costs per-fragment increase.
Order Independent Transparency (OIT) demo application & source code
New keys:
- 'x' : Switch between ABuffer Algorithms (V1.0 Basic and V2.0 Linked List)
- 'n' : Display the number of fragments per pixel.
- 'g' : Swith between Alpha-Blending and Gelly resolve modes.
UPDATE 28/10/2010: Oscarbg did a port of the demo so that it can run on AMD (mainly removing everything related to shader_load/store), more info there:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=285258#Post285258
But sadly still does not work on AMD, so if an AMD guy read that, your help is welcome !
I can't try myself since I don't have any AMD card :-(
July 19, 2010 at 11:29 PM
Very interesting but unfortunately, I can't run the demo...
Still, few questions:
- What is the effect of the page size on performance?
- Are depths stored in RGBA components for the case of a 4-size page?
- Also, what about using square pages? It may results in better performance because of texture 2D cache locality? (or may be not because indeed pages are small).
July 20, 2010 at 1:50 AM
Hi, thanks for your comment.
Is there a problem with the code so that you can't run the demo ?
About your questions:
- The larger the page and the better the performances are, however, starting from 8 fragments per page, I did not notice any performance gain of bigger pages.
- Yes I store the z into the forth component of the fragments.
- In fact the code use either a texture buffer or a simple global memory buffer object to store the pages, and both have a linear cache applied (the L1/L2 cache hierarchy or the texture cache). So using 2D texture is not really needed.
Cheers
July 20, 2010 at 9:03 AM
Thank you for your answers!
An nothing wrong with your demo, it's just that I only have a GF275 at home. :/
July 20, 2010 at 8:21 PM
Cela fait penser un peu aux gigavoxels enfin l'histoire du Pool. C'est ce passage qui m'a fait penser : Those pages are stored and allocated in a shared pool whose size is changed dynamically depending on the scene demands.
C'est un peu le cas ou alors je suis complètement a côté de la place :D
D'ailleurs ça me fait penser, il y a le code source des gigavoxels qui est a disposition quelque part ou pas ?
C'est une technique assez intéressante je trouve.
July 22, 2010 at 1:14 AM
Awesome demo..
Some questions:
*I remember seeing AMD OIT technique saying required special trick for MSAA support? is your technique amenable to such trick and if yes has your current demo support for MSAA in linked list mode?
What about MSAA in plain Abuffer demo?
Seems if you supported it must use sample_shading extension.. and
gl_SampleMask and you aren't using it so no support for MSAA,right?
*Also I remember seeing in presentation that storing fragment shader required [earlydepthstencil] for both corectness and improved perf if you mix with no transparent objects so in ogl 4.0 load_store extension equivalent is found:
layout(early_fragment_tests) in;
I see you don't use so seem you can improve perf in a more general env by using that and also assure correctness I think..
would be good if I you can some no transparent object to the mix to test that..
*What's the gain in perf (say using default dragon model) between your method and using AMD OIT linked lists of fragments? also what's the mem higher requeriment using pages of 4-6 pixels?
Also is your method with 1 fragment per page equivalent to AMD demo or I'm not reading deep enough?
*I have seen in code:
//memoryBarrier(); //We theoretically need it here, but not implemented in current drivers !
so the GLSL compiler fails correct? also theoretically isnt' fully correct the fragment correct?
Many thanks..
July 22, 2010 at 12:25 PM
Hi Oscar,
That's a lot of interesting questions :-)
* About MSAA support, I think AMD demo is managing it by simply adding an additional coverage field per fragment, that is filed thanks to DX input coverage info. The exact same trick can be applied here in OpenGL using the GL4 gl_SampleMaskIn variable. They also provide a "resolve" (they call it "rendering") phase that output a multisampled buffer by forcing the resolve FS to run on each sample instead of once per pixel. That's also something that can be done as you said using glMinSampleShading(...) (that's now in the GL4 core). I did not implemented it but it does not seem difficult :-)
* About early depth-stencil, as you said it's interesting when you combine with opaque objects in order to prevent occluded fragments to be generated and stores, and the GL4 layout specifier allow to force early depth-stencil. You are true, I should extend the demo with an opaque object intersecting. Will see if I have time :-)
* I was not able to compare directly with AMD demo, since the code does not seem to ba available (is it ?). You are right, their approach is theoretically equivalent to mine with PAGE_SIZE==1. Another difference is that they rely on the UAV's global counter (that is not expose in OpenGL) to insert fragments into their shared pool, while I rely on a standard atomic operation. Don't know if it gets a real gain. On my application at least, I get a 30% speed difference between PAGE_SIZE==1 and PAGE_SIZE==4 (but I still have a little bit of extra code per fragment to manage the pages).
*Yep GLSL compiler fails on memoryBarrier(), it will be fixed in upcoming drivers. That means that the mutex can theoretically be released before the previous write are really visible to other threads. In practice it does not seem to happen.
Hope it helps :-)
July 23, 2010 at 12:33 AM
I assume you mean PAGE_SIZE==4 was 30% faster than 1. That surprises me as I'd have thought it wouldn't matter. With size 1 all writes to the fragment-link buffer are sequential so they should be efficient. Reads would be less efficient though so I can see the resolve step being slower. I didn't look at the code so maybe I'm missing something.
Have you analyzed the cost of the buffer creation compared to the resolve?
October 25, 2010 at 4:16 PM
Hi,
I think you should rather optimize for the buffer creation. I suppose the reason why you had better performance with PAGE_SIZE==4 is because you use also R/W buffers in the resolve shader.
At least on AMD I know that binding the texture buffer as R/W buffer will mean that the "complete path" is used that allows writes and atomics rather than the "fast path" which is used for simple input texture buffers. The throughput difference between the two can be more than 1:10.
So I suggest you to try not binding the data as R/W buffer in the resolve shader, maybe you get some speedup and maybe in that case PAGE_SIZE==1 will be faster.
October 28, 2010 at 6:57 PM
Hi i have fixed for AMD compatibilty:
see and comment please:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=285258#Post285258
October 28, 2010 at 7:07 PM
Great work oscar, thanks a lot !
I wanted to try on AMD when I saw they finally support image_load_store but did not have time yet.
Hope someone at AMD will help on this !
Still, I suspect atomic ops to be faster on NVIDIA.
Something else, what happens to your blog ? You have been hired by NVIDIA/AMD/INTEL ?? ;-)
February 1, 2011 at 9:34 AM
Hi,
i was trying to run your demo on nvidia r265 drivers, but it fails because the NV_shader_buffer_store extension now seems to not be supported. The EXT_shader_image_load_store converted demo of oscarbg crashes when trying to switch the A-Buffer mode (x) with a lot of compile errors in the renderABufferFrag.glsl file.
Regards
-chris
February 7, 2011 at 2:56 PM
Hi,
thanks for the interesting demo,
what do suggest to use, global memory buffer or textures ?
Thanks
February 19, 2011 at 4:41 PM
Hi, in my tests, it appears that global memory buffers are slightly faster than textures on Fermi. I guess it's due to an higher latency of read operations when going through texture samplers.
October 3, 2011 at 5:23 PM
Hi Cyril, i would really like to try out your demo, but unfortunately neither the linked list version nor the texture array version work on 500M chip series (which is OGL4.1)
so basically to run the demo, we do need an opengl4.2 capable graphics card? (meaning all the laptops are excluded)
June 25, 2012 at 4:30 AM
Hi,Cyril .Is there any method to use Linked lists on Nvidia graphic card?
June 25, 2012 at 4:58 AM
Of course this code was developed on NVIDIA :)
July 4, 2012 at 8:03 AM
Hi,Cyril. Unfortunately I also meet this error "NV_shader_buffer_store not support". I have updated the driver for my GTX560,but it didn't work.
July 5, 2012 at 4:49 AM
I found the compilation error is because use the word "inline" in your GLSL files.When I remove "inline" ,it compiled successfully.
July 5, 2012 at 7:38 AM
Right thanks, it has been a long time since I need to modify this, sorry it was compiling before on NVIDIA but has never been standard.
August 29, 2013 at 5:30 PM
I'm in the process of porting this example to use a new OpenGL-based rendering library I've been developing. Can I have permission to distribute my modified version of your example as part of this project? I would leave your original copyright comment blocks intact. (The project is open source: jag-3d.googlecode.com.)
August 29, 2013 at 6:50 PM
Here's a bugfix for rendering the full window triangle pair.
diff -r 02266942dc79 -r 911caea8f207 abuffer/ABufferGL4.cpp
--- a/abuffer/ABufferGL4.cpp Wed Aug 28 15:28:17 2013 -0600
+++ b/abuffer/ABufferGL4.cpp Thu Aug 29 10:46:29 2013 -0600
@@ -278,7 +278,7 @@
glVertexAttribPointer (glGetAttribLocation(prog, "vertexPos"), 4, GL_FLOAT, GL_FALSE,
sizeof(GLfloat)*4, 0);
- glDrawArrays(GL_TRIANGLES, 0, 24);
+ glDrawArrays(GL_TRIANGLES, 0, 6);
//checkGLError ("drawQuad");
}
August 14, 2015 at 10:40 AM
Do you actually allocate pages from within GLSL, or do you simply take them as needed from a host allocated buffer? If it's the first, what extension allows you to do that?