The main problem with my first ABuffer implementation (cf. my previous post) was that a fixed maximum number of fragments per pixel has to be allocated at initialization time. With this approach, the size of the ABuffer can quickly become very large when the screen resolution and depth complexity of the scene increase.


Using linked lists of fragment pages per pixel

Original basic approach

To try to solve this problem, I implemented a variant of the recent OIT method presented at the GDC2010 by AMD and using per-pixel linked lists. The main difference in my implementation is that fragments are not stored and linked individually but into small pages of fragments (containing 4-6 fragments). Those pages are stored and allocated in a shared pool whose size is changed dynamically depending on the scene demands.
Using pages allows to increase the cache coherency when accessing the fragments, improve the efficiency of concurrent access to the shared pool and decrease the storage cost of the links. This is at the cost of a slight over-allocation of fragments.
The shared pool is composed of a fragment buffer where fragment data is stored, and a link buffer storing links between the pages that are reverse chained. Each pixel of the screen contains the index of the last page it references, as well as a counter with the total number of fragments stored in that pixel (incremented using atomics).
The access to the shared pool is manage through a global page counter, incremented using an atomic operation each time a page is needed by a fragment. The allocation of a page is done by a fragment when it detects that the current page is full, or there is not any page yet for the pixel. This is done inside a critical section to unsure that multiple fragments together in the pipeline and falling into the same pixel will be handled correctly.



ABuffer memory occupancy differences:


Some memory occupancy examples of the fragments storage depending on screen resolution (Basic vs Linked Lists):
  • 512x512:    64MB vs 6.5MB 
  • 768x708:   132.7MB vs 11.7MB
  • 1680x988:  405MB vs 27.42MB

The cost of this huge reduction of the storage need is that the rendering speed decreases compared to the basic approach. Linked lists can be down to half the speed of the basic approach when per-fragment additional costs are low, due to the additional memory access and the increased complexity of the fragment shader (more code, more registers). But this cost seems well amortized when the shading costs per-fragment increase.

Order Independent Transparency (OIT) demo application & source code
New keys:
  • 'x' : Switch between ABuffer Algorithms (V1.0 Basic and V2.0 Linked List)
  • 'n' : Display the number of fragments per pixel.
  • 'g' : Swith between Alpha-Blending and Gelly resolve modes.

UPDATE 28/10/2010: Oscarbg did a port of the demo so that it can run on AMD (mainly removing everything related to shader_load/store), more info there:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=285258#Post285258
But sadly still does not work on AMD, so if an AMD guy read that, your help is welcome !
I can't try myself since I don't have any AMD card :-(