Interview on 3DVF
Ph.D defended and postoc at NVIDIA Research
Real-Time Shadows Book
Siggraph 2011 Slides
Here is a quick list of interesting Siggraph 2011 courses and talks available online.
Courses/Talks :
Advances in Real-Time Rendering in Games
Beyond Programmable Shading
State-of-the-Art: Interactive Global Illumination
Destruction and Dynamics for Film and Game Production
Filtering Approaches for Real-Time Anti-Aliasing
Production Volume Rendering
Compiler Technology for Rendering
Liquid Simulation with mesh-based Surface Tracking
Companies:
CRYTEK
NVIDIA Slides , Videos
INTEL
OpenGL/OpenCL :
OpenGL BOF Slides
Introduction to Modern OpenGL Programming
OpenCL BOF Slides
Others:
HPG 2011
Also a great list with more posters and talks is available on Stephen Hill's Blog
There is also my talk on Interactive Indirect Illumination Using Voxel Cone Tracing :-)
OpenGL 4.2 specification released
Among interesting things, the shader_atomic_counters, conservative depth, instanced transformed feedback, the integration of shader_image_load_store and more !
And congrats to Christophe for his first credit in the spec :-D
"Interactive Indirect Illumination Using Voxel Cone Tracing" paper (Pacific Graphics 2011)
http://www.icare3d.org/research-cat/publications/interactive-indirect-illumination-using-voxel-cone-tracing.html
On my INRIA webpage:
http://research.nvidia.com/publication/interactive-indirect-illumination-using-voxel-cone-tracing
Also, don't forget to attend my talk if you are going to Siggraph 2011 !
Free 3D meshes links
I started to compile links to websites where free 3D models can be found. If you know other good websites, feal free to post them in the comments :-)
Static models and scenes:
- Great collection of models for scientific publications on Morgan McGuire webpage: http://graphics.cs.williams.edu/data/meshes.xml
- 3D Render challenge: http://www.3drender.com/challenges/
- Crytek: http://www.crytek.com/cryengine/cryengine3/downloads
- Keenan Crane : http://www.cs.caltech.edu/~keenan/models.html
- Sibenik model: http://hdri.cgtechniques.com/~sibenik2/download/
- AIM@Shape : http://shapes.aimatshape.net/
- Characters Creative Common: http://artist-3d.com/
- Characters: http://www.3dvalley.com/3d-models/characters
- Blender files: http://www.blendswap.com/3D-models/category/featured/
- Archive 3D: http://archive3d.net/
Animated models and scenes:
- Ingo Wald: http://www.sci.utah.edu/~wald/animrep/
- MIT CSAIL 1: http://people.csail.mit.edu/drdaniel/mesh_animation/index.html
- MIT CSAIL 2: http://people.csail.mit.edu/drdaniel/dynamic_shape/index.html
- MIT Animals and Face: http://people.csail.mit.edu/sumner/research/deftransfer/data.html
- Face data: http://grail.cs.washington.edu/software-data/stfaces/index.html
- Pants: http://www.ryanmwhite.com/research/cloth_cap.html
Interactive Indirect Illumination and Ambient Occlusion Using Voxel Cone Tracing
The paper ha been published at Pacific Graphic. You can find the authors version of it on my research page:
http://www.icare3d.org/research-cat/publications/interactive-indirect-illumination-using-voxel-cone-tracing.html
[Update 30/06/2011]: Benoit Rogez wrote a very good article about our approach on 3DVF (in French) : here
[Update 10/06/2012]: More details on the algorithm and the fast voxelization and octree construction can be found in my GTC 2012 presentation.
Indirect lighting (global illumination):
High quality video: http://artis.imag.fr/Membres/Cyril.Crassin/GIVoxels/Siggraph11_GI1.mov





Ambient occlusion:
High quality video: http://artis.imag.fr/Membres/Cyril.Crassin/GIVoxels/Siggraph11_AO1.mov



The vicious circle of generalization @AltDevBlogADay
Full story here :-D
This can be described as vicious circle of generalization:
- We want the code to be as generalized as possible.
- We design everything future-proof and extendible.
- When a feature request arrives, we’re doomed we need to change a lot of code.
- Why?
- Because everything was designed as generalized as possible.
- goto 1;
GDC 2011 Technical Presentations
GDC 2011 is over now and presentations start to appear on-line :-)
DICE: http://publications.dice.se/
NVIDIA: http://www.nvidia.com/object/gdc2011.html
AMD: http://developer.amd.com/documentation/presentations/Pages/default.aspx#GDC
INTEL: http://software.intel.com/en-us/articles/intelgdc2011/
Khronos on OpenGL: http://www.khronos.org/library/detail/2011-gdc-opengl
More links can be found on this blog: http://msinilo.pl/blog/?p=724
Season of next-gen game engines
This is the season of the new game engines : Dice Frostbite 2, Crytek CryEngine 3 and Epic Unreal Engine 3 !
CUDA 4.0 RC Released to Registered Developers
Here it is, CUDA 4.0 RC just got released to NVIDIA Registered developers.
Interesting stuff from the CUDA manual:
- Layered Textures Support (GL_TEXTURE_1D/2D_ARRAY) : New tex.a1d/.a2d modifiers in PTX. But unfortunately the surface instruction do not support them yet, Grrrr
Layered textures are created using cudaMalloc3DArray() with the cudaArrayLayered flag. New cudaTextureType2DLayered/ cudaTextureType2DLayered texture sampler types and tex1DLayered()/tex2DLayered() access intrinsics.
- New .address_size PTX specifier : Allows to specify the address size (32b/64b) used throughout a PTX module.
- Inline PTX assembly: This feature was already present since CUDA 2.x but was not officially supported. It's now fully supported and documented :-D
- Driver API, new thread-safe stateless launch API function cuLaunchKernel(): cuLaunchKernel(kernelObj, blocksPerGrid, 1, 1, threadsPerBlock, 1, 1, 0, 0, args, 0);
- FERMI ISA documented and supported by cuobjdump.
- Enhanced C++: Support for operators new and delete, virtual functions.
CUDA 4.0 announced @GDC 2011
There is also the support for direct peer-to-peer communication between GPUs and mappping multiple GPUs (and potentially other third party devices like network/infiniband) memory into the same address space in order to provide direct memory access (Unified Virtual Addressing, UVA). Virtual functions should also now be supported, along with the New and Delete functions for dynamic memory allocations from kernels.
Update: CUDA 4.0 RC released to registered developers
Slides are available there: http://bit.ly/cuda4features
Among the interesting novelties I did not see before, it seems inline PTX will be officially supported with this release ! Also the dissasembler (cuobjdump) that were previously limited to Tesla ISA now support Fermi ISA disassembly. Take a look as the manual for the list of supported instructions.
Writing a PhD thesis using FreeMind and LateX : My FreeMind Latex import plug-in

#AltDevBlogADay
Today, I would like to invite you to follow an awesome new participating game development blog called #AltDevBlogADay. This blog has been founded in January by Mike Acton who is engine director at Insomniac Games, and group together many very good game developers who publish one article each day on the website.
Here is a short list of the article I particularly liked and I advice you to take a look at, among those already published:
- Aliasing, the silent killer
- Forward vs. Deferred Rendering: why/when forward rendering still matters
- Think low level, write high level
- Preparing for Parallelism
- Intuition for Gamma Correct Rendering
- Link time sorting
- Research tastes better when served with source
- Managing Decoupling
- Radix Sort for Humans
- Non Virtual Interfaces
- Vectiquette
- The Virtual and No-Virtual
- Refactoring for Task Concurrency
- Performance metaprogramming
- The Rendering Equation - A Pictorial Introduction
- Managing Coupling Part 2 — Polling, Callbacks and Events
- Alternatives to malloc and new
- Data Structures: One size does not fit all
January pop stack post !
- NVIDIA Announced Project Denver : A fully integrated CPU+GPU chip dedicated to high-performance computing and based on ARM cores. It is designed to power future products ranging from personal computers to servers and supercomputers. It looks like NVIDIA's answer to AMD Fusion and is also clearly a "we don't believe in the future of x86 architectures" launched to the industry. I really believe this is the right approach to bring high performance CPUs to all market segments. Thanks to the rise of mobile and tablet like devices, we may finally see the end of the hegemony of the x86 architecture. I really hope that it will not be the last dinosaur from NVIDIA ;-)
- gDebugger is now free !
- “Principles of Digital Image Synthesis” book is now free for download. And a nicely formated PDF can be found here.
- Data-Oriented Design - Links and Thoughts : The concept of Data-Oriented Design (DOD) as opposed to Object Oriented Programming (OOP) is a concept that is emerging and gaining popularity among the game developers community lately. The idea is that instead of building performance sensitive parts of applications around classes that represent individual objects with ineficient -non-cache coherent- memory access patterns, it is more efficient to build them by thinking in terms of data flows and memory access patterns. That's a concept that may not be unfamiliar to GPU developers :-D
- Overtime vs Productivity : Two very interesting blog posts by Allan Mckay that discuss on how people work, why we often get burned out and how to avoid this.
- A lot of very interesting technical documentations about NVIDIA GPUs and coming from reverse engineering of various hardware can be found on pscnv Git repository. Pscnv is a fork of the nouveau project, an open source driver for NVIDIA GPUs.
"How the GPU works" @c0de517e
I rediscovered a very good in-depth explanation on how GPU works published in 2008 on c0de517e blog:
Part 1 Part 2 Part 3
CUDA "Better Performance at Lower Occupancy" @GTC2010
A friend point me this very interesting talk at NVIDIA GTC:
Better Performance at Lower Occupancy
They deny two common fallacies that CUDA developer usually believe in:
- Multithreading is the only way to hide latency on GPU
- Shared memory is as fast as registers
All the GTC2010 presentations can be found there (with slides and videos !):
http://www.nvidia.com/object/gtc2010-presentation-archive.html
Old Real-Time GPU Raytracer
PS: It is funny to see what was possible at this time, but it was developed quickly and the shader code itself is not a reference !
Fluid Simulation for Video Games @INTEL

CUDA 3.2 Final released
Download it on NVIDIA Developer website !
http://developer.nvidia.com/object/cuda_3_2_downloads.html
NVIDIA Fermi GPU and Architecture Analysis @Beyond3D

In this article, I also discovered "Pomegranate", a parallel hardware architecture for polygon rendering developed at Stanford and that seems to be very close to the way Fermi handle parallel work distribution of the different steps of the graphics pipeline. Pomegranate [Eldrige et al, 2000]
Discussions are on Beyond3D Forum.
Here are some interesting statements:
Read more »
First reviews of the NVIDIA GF110: GTX580

Texture and buffer access performance on Evergreen architecture @rastergrid.com
A very interesting article about textures and buffer access performances in OpenGL on AMD Evergreen architecture:
http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/
Various stuff from October
I did not have much time to update this blog lately, so here are some interesting stuff I did not post during October:
- GPU-Assisted Malware : http://www.ics.forth.gr/dcs/Activities/papers/gpumalware.malware10.pdf
- Thrust 1.3 released : http://gpgpu.org/2010/10/07/thrust-v1-3-release
- OpenGL 4.1 drivers status : g-truc creation
- "Can CPUs Match GPUs on Performance with Productivity ?" : IBM Research
- GPU Technology Conference Session Video Archive : NVIDIA
- EASTL : An implementation of the C++ STL made by EA and optimized for video games usages
OpenGL SuperBible Fifth Edition
Visual Transistor-level Simulation of the 6502 CPU
NVIDIA nextgen Kepler and Maxwell architectures codenames unveiled

So we now publicly know that Johannes Kepler and James Clerk Maxwell are the two next scientists that will succeed to Enrico Fermi !
OpenGL 4.2+ speculations @g-truc
An awesome post from Christophe Riccio about his speculation for next OpenGL releases:
http://www.g-truc.net/post-tech-lastest-full.html#post0330
Here is the summary:
CUDA 3.2 RC
Available to registered developers.
Here are the interesting new stuff I found:
- Support for malloc() and free() in kernels: dynamic global memory allocation ! This is implemented with a new syscall linking mechanism that seems to allow kernel to be linked to precompiled system calls. Infos on the linking mechanism (.calltargets , .callprototype ) can be found in section 10.3 of the PTX ISA manual. I hope this mechanism will get exposed for user functions in the API !
- 64 bits addressing support in CUDA driver AP: Allows manipulating more than 4GB of device memory.
- New System Management Interface (nvidia-smi) for reporting various hardware counters informations
- New stream synchronization function cudaStreamSynchronize(): allow GPU-side inter-streams synchronisation.
- A set of new calls is available to allow the creation of CUDA devices with interoperability with Direct3D devices that use SLI in AFR (Alternate Frame Rendering)
- New flag to driver API texture reference (CU_TRSF_SRGB), which enables sRGB->linear conversion on a read.
- Reference manual adds architecture information on GF10x (GF104, GF106, GF108) class hardware (compute capability 2.1)
- Add tld4 (fetch4) instruction for loading a component (r, g, b, or a) from the four texels compising the bilinear interpolation footprint of a given texture location.
- Add kernel pointer parameter state space and alignment of the memory being pointed to.
New CUDA Libraries
- CUSPARSE, supporting sparse matrix computations.
- CURAND, supporting random number generation for both host and device code with Sobel quasi-random and XORWOW pseudo random routines.
Official CUDA disasembler for sm_1.x real ISA
Yesturday NVIDIA released an official disasembler for sm_1.x (pre-Fermi) real hardware ISA. It's like an official version of DECUDA :-) (that Wladimir stopped to develop)
It takes either an ELF CUDA binary, a cubin or even an exe file, and provides the low level assembly code of the CUDA kernels.
It is only available for registered developer for now, but you can get a little more information the CUDA forum.
That's something a lot of developers have been asking for for a while. That allows to see the impact of optimizations on the real microcode, and it is particularly important for register usage for instance (since registers allocations is done after the PTX level).
Nice NVIDIA finally end up unveiling it's real hardware ISA instructions. AMD is still a little bit ahead on this since the ISA instructions and microcode is available even for the Evergreen architecture (RV870): http://developer.amd.com/gpu/ATIStreamSDK/assets/AMD_Evergreen-Family_ISA_Instructions_and_Microcode.pdf
What exactly is a doctorate @Gizmodo
What is EXACTLY a PhD ? The answer is there :
http://gizmodo.com/5613794/what-is-exactly-a-doctorate
"Keep pushing" !
OpenGL 4.1 review @g-truc + Samples pack
OpenGL 4.1 Review: http://www.g-truc.net/post-tech-lastest-full.html
Christophe also published an updated version of his sample pack, with a nice table showing how the new stuff works on both NVIDIA and ATI (Beta drivers): http://www.g-truc.net/post-tech-lastest-full.html#post0321
UPDATE: Another good article about OpenGL 4.1: http://rastergrid.com/blog/2010/08/an-introduction-to-opengl-4-1/
Hacking D3D virtual calls + v-table hacking made useful @Humus
Direct3D API is a fully C++ object-oriented API and rely on runtime polymorphisms (virtual fonction calls) to be expendable and easily being able to provide different implementations. So all API calls are virtual calls instead of being plain C calls like in OpenGL.
Every slightly experimented C++ developer knows that virtual functions calls introduce overhead and that they should be avoided inside inner loops. Humus shows how these virtual calls can be replaced by standard calls by hacking the API objects v-table in order to keep a plain C pointer on these virtual methods !
http://www.humus.name/index.php?page=Comments&ID=321
I love this kind of hack ! But as Humus explains, D3D (like OpenGL since OpenGL 3.0) do not rely on immediate mode anymore, that means that API calls are usually consuming a slightly large amount of cycle compared to the overhead of a virtual call.
That means that in practice you wont get significant performance gain from this hack, but this is just really cool :-D And this method could still be useful to overcome performance problems in more badly design APIs !
UPDATE : D3D v-table hacking... made useful !
Humus just published another trick that shows how hacking the v-table of the D3D context can be used to... replace the default API calls by your own enhanced calls !
Humus shows how this can be useful to count the number of time an API function is called for instance. This can be done by overwriting the original object v-table pointer with the address of your own v-table. More details there :
http://www.humus.name/index.php?page=Comments&ID=322
Various GPU stuff from Siggraph time
NVIDIA Siggraph 2010 presentations available in streaming there.
- Among them a very interesting presentation of OpenGL 3.3/4.0/4.1 by Barthold Lichtenbelt and Mark Kilgard.
Siggraph 2010 Khronos OpenGL BOF and OpenCL BOF slides available.
Reference pages for OpenGL 3.3 and OpenGL 4.1 are online on opengl.org !
- I already said it, but I love the way OpenGL has been evolving since OpenGL 3.0 ! It really seems to be a real willpower from the vendors to make it again a first class innovative API :-)
OptiX 2.0, and Cg Toolkit 3.0 released by NVIDIA
- SM 5 support in Cg at last !
- I tried OptiX (previoulsy NVIRT) recently and I was really impressed, especially by the easiness of usage of the "high level" optixu interface. That's really an awesome tool.
OpenGL 4.1 Specifications released + NVIDIA drivers
It does not bring a lot of new features, but it's still great to see OpenGL quickly evolving ! Direct State Access does not get into the core yet (sorry Christophe ;-), and I am not sure we will get it before OpenGL 5.0...
As usual, NVIDIA is very likely to announce the release of drivers supporting OpenGL 4.1 during the OpenGL BOF :-)
OpenGL.org forum official thread here.
Here are the main new features:
- Viewport Array (ARB_viewport_array). This is, for me, the most interesting new feature. It allows to manipulate multiple viewports inside a given render call. Viewports control the behavior of the "viewport transformantion" stage (view space -> window coordinates, scissor test). Multiple viewports can be created and the geometry shader can direct emitted primitives to a selected viewport. A separate viewport rectangle and scissor region can be specified for each viewport.
- Ability to get the binary representation of a program object (ARB_get_program_binary). This is a long-awaited feature present in DX for a while.
- Separate shader objects (ARB_separate_shader_objects). It allows to compile and to to link a separate program for each shader stage (PS/GS/TCS/TES/FS). A Program Pipeline Object is introduced to manipulate and bind the separate programs. That's also a useful features, and that was the way to do in Cg.
- Improved compatibility with OpenGL ES 2.0 (ARB_ES2_compatibility). Adds a few missing functions and tokens.
- Support for 64bits vertex attributes in GLSL (ARB_vertex_attrib_64bit).
- Increases required size for textures/renderbuffers.
- ARB_debug_output: Callback mechanisms to receive enhanced errors and warning messages.
- ARB_robustness: Address multiple specific goals to improve robustness, for example when running WebGL applications. For instance it provide additional "safe" APIs that bound the amount of data returned by an API query.
- ARB_shader_stencil_export: Ability to set stencil values in a fragment shader for enhanced rendering flexibility :-)
- ARB_cl_event: Link OpenGL sync objects to OpenCL event objects for enhanced OpenCL interoperability.
UPDATE 27/07: That's done, NVIDIA released it's OpenGL 4.1 drivers ! Everything there.
OpenGL 4.0+ ABuffer V2.0: Linked lists of fragment pages

Original basic approach
To try to solve this problem, I implemented a variant of the recent OIT method presented at the GDC2010 by AMD and using per-pixel linked lists. The main difference in my implementation is that fragments are not stored and linked individually but into small pages of fragments (containing 4-6 fragments). Those pages are stored and allocated in a shared pool whose size is changed dynamically depending on the scene demands.
Using pages allows to increase the cache coherency when accessing the fragments, improve the efficiency of concurrent access to the shared pool and decrease the storage cost of the links. This is at the cost of a slight over-allocation of fragments.
The shared pool is composed of a fragment buffer where fragment data is stored, and a link buffer storing links between the pages that are reverse chained. Each pixel of the screen contains the index of the last page it references, as well as a counter with the total number of fragments stored in that pixel (incremented using atomics).
The access to the shared pool is manage through a global page counter, incremented using an atomic operation each time a page is needed by a fragment. The allocation of a page is done by a fragment when it detects that the current page is full, or there is not any page yet for the pixel. This is done inside a critical section to unsure that multiple fragments together in the pipeline and falling into the same pixel will be handled correctly.
ABuffer memory occupancy differences:
Some memory occupancy examples of the fragments storage depending on screen resolution (Basic vs Linked Lists):
- 512x512: 64MB vs 6.5MB
- 768x708: 132.7MB vs 11.7MB
- 1680x988: 405MB vs 27.42MB
- 'x' : Switch between ABuffer Algorithms (V1.0 Basic and V2.0 Linked List)
- 'n' : Display the number of fragments per pixel.
- 'g' : Swith between Alpha-Blending and Gelly resolve modes.
UPDATE 28/10/2010: Oscarbg did a port of the demo so that it can run on AMD (mainly removing everything related to shader_load/store), more info there:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=285258#Post285258
But sadly still does not work on AMD, so if an AMD guy read that, your help is welcome !
I can't try myself since I don't have any AMD card :-(
CUDA 3.1 Final out
Check it there: http://developer.nvidia.com/object/cuda_3_1_downloads.html
Did not find new features since 3.1beta, just bug fixes. 2D surface write intrinsics that were buggy in 3.1 Beta are now working for instance.
There is also a reference to GPUDirect in the "release highlights", but I did not find any documentation for it. GPUDirect seams to allow third party devices to do direct memory acces to the GPU memory (cf the anouncement here).
CUDA dynamic template parameters (2/2) : Emulation with metaprogramming
Following my last post, I will describe how template metaprogramming can be used to "emulate" dynamic kernel template parameters. This approach does not prevent explosion of the amount of generated code at compile time, but provides a simple interface to "transform" runtime variable values into kernel template parameters.
If you are not familiar with templates metaprogramming, I suggest you to read Andrei Alexandrescu's great book: Modern C++ Design: Generic Programming and Design Patterns Applied. This book comes with a small and simple metaprogramming library called Loki that is just a series of helper headers. My code relies on a few structures from it.
Code description
Here is what a kernel call looks like with this mechanism:
The actual kernel call is encapsulated into a functor (here MyKernelCallFunctor) that is passed, with the runtime variables, to the call function of what I called the KernelCaller. The KernelCaller "transform" the set of runtime variables passed to the call function into a set of compile time parameters passed to the functor in the form of a typelist.
The functor is in charge making the actual kernel call by extracting the template values from the typelist, and by passing to the __global__ function the arguments previously passed through it's constructor and stored into the structure. It has to be written for each kernel you want to launch:
The KernelCaller is the core of the mechanism. It operates by recursively adding static values into the typelist by testing the value of the runtime variables. In reallity the compiler will generate code for all the branches and so will instanciate all possible sets of compile time parameters as typelists, the right one being selected at runtime. To ease usage, the runtime variables are passed using a variable argument list and transformed into a vector containing the values.
The compile time recursion is stopped thanks to the number of arguments passed as a template parameter to the KernelCaller. :
Here it is, as is we already have a working mechanism able to deal with arbitrary number of boolean parameters !
But this is not very funny to write and not very scalable. So once again we can rely on templates to generate the code for us !
In this case here is what the KernelCaller::call function looks like:
And here is the ValueSelector mechanism:
Have fun, and... sorry for this crazy long post !
[Update June 21]
Code updated: There was 2 bugs in the code :
- Using a reference as first parameter of a vararg is forbiden by C++ and resulted in a crash on 32bits machines -> transformed into a pointer.
- The MaxArgValue argument to the KernelCaller was not kept by the ValueSelector, resulting in the use of the default value of 10 whatever the user specified so it was very slow to compile even for boolean values.
I also added a #pragma message in order to follow the recursive compilation.
Dependency on cutil_inline.h removed.
CUDA dynamic template parameters (1/2) : My dream CUDA feature
One common problem when using templates to parametrize CUDA kernels (cf. my previous post) is to be able to dynamically select the set of template parameters to use for a call, depending on runtime variables. This usually leads to an exploding number of cascaded if/switch and a lot of code copy/paste to instantiate the whole parameters tree for each kernel call.
This situation is illustrated by the following code for boolean parameters:
In addition to the pain it is to write, such code results in the compilation of an exploding number of versions of the same kernel, one for each instantiated template configuration.
This feature would be implemented by taking advantage of a C-level JIT (Just In Time) kernel compilation (current CUDA JIT compiler operates at the PTX level). It implies recompiling the kernel at runtime with a new set of template parameters each time a value changed. It requires tracking the last value of each parameters so that recompilation happens only when necessary. To be a bit more efficient, generated code could also be cached in some way so that it can be reused.
This would change the kernel compilation paradigm to something closer to the OpenCL compiling model, but while keeping the nice CUDA-C syntax provided by nvcc.
That feature would be very useful, and it would be great if NVIDIA makes CUDA evolves in that direction, or if someone write a JIT CUDA-C compiler that allows that !
More details coming in the next post !
Fast and Accurate Single-Pass A-Buffer using OpenGL 4.0+
One of the first thing I wanted do try on the GF100 was the new NVIDIA extensions that allows random access read/write and atomic operations into global memory and textures, to implement a fast A-Buffer !
It worked pretty well since it provides something like a 1.5x speedup over the fastest previous approach (at least I know about !), with zero artifact and supporting arbitrary number of layers with a single geometry pass.
Sample application sources and Win32 executable:
Sources+executable+Stanford Dragon model
Additional models
Be aware that this will probably only run on a Fermi card (Forceware drivers >=257.15). In particular it requires: EXT_shader_image_load_store, NV_shader_buffer_load, NV_shader_buffer_store, EXT_direct_state_access.
Application uses freeglut in order to initialize an OpenGL 4.0 context with the core profile.
Keys:
- 'a' Enable/Disable A-Buffer
- 's' Enable/Disable fragments sorting. Disable= closest fragment kept during resolve.
- 'g' Swith between Alpha-Blending and Gelly resolve modes.
- 'c' Enable/Disable alpha correction when in Alpha-Blending mode.
- 't' Swith between using textures or global memory for A-Buffer storage.
- '1'-'3' Change mesh (requires the additional models).
A-Buffer:
Basically an A-buffer is a simple list of fragments per pixel [Carpenter 1984]. Previous methods to implement it on DX10 generation hardware required multiple passes to capture an interesting number of fragments per pixel. They where essentially based on depth-peeling, with enhancements allowing to capture more than one layer per geometric pass, like the k-buffer, stencil routed k-buffer. Bucket sort depth peeling allows to capture up to 32 fragments per geometry pass but with only 32 bits per fragment (just a depth) and at the cost of potential collisions. All these techniques were complex and especially limited by the maximum of 8 render targets that were writable by the fragment shader.
This technique can handle arbitrary number of fragments per pixels in a single pass, with only limitation the available video memory. In this example, I do order independent transparency with fragments storing 4x32bits values containing RGB color components and the depth.
Technique:
The idea is very simple: Each fragment is written by the fragment shader at it's position into a pre-allocated 2D texture array (or a global memory region) with a fixed maximum number of layers. The layer to write the fragment into is given by a counter stored per pixel into another 2D texture and incremented using an atomic increment (or addition) operation ( [image]AtomicIncWrap or [image]AtomicAdd). After the rendering pass, the A-Buffer contains an unordered list of fragments per pixel with it's size. To sort these fragments per depth and compose them on the screen, I simply use a single screen filling quad with a fragment shader. This shader copy all the pixel fragments in a local array (probably stored in L1 on Fermi), sort them with a naive bubble sort, and then combine them front-to-back based on transparency.
Performances:
To compare performances, this sample also features a standard rasterization mode which renders directly into the color buffer. On the Stanford Dragon example, a GTX480 and 32 layers in the A-Buffer, the technique range between 400-500 FPS, and is only 5-20% more costly than a simple rasterization of the mesh.
I also compared performances with the k-buffer which code is available online (still be careful, it may not be super optimized). On the GTX480, with the same model and shading (and 16 layers), I can get more than a 2x speedup. Based on that results, I strongly believe that it is also close to 1.5x faster than the bucket sort depth peeling, without it's depth collision problems.
EDIT: Artifacts in stencil-routed k-buffer came from a bug in DXUT, images removed. Also added a warning about the performances of the k-buffer OpenGL code from Louis Bavoil page.
EDIT 2: The follow-up of this work using per-pixel linked-lists can also be read there: http://blog.icare3d.org/2010/07/opengl-40-abuffer-v20-linked-lists-of.html