CUDA 3.1 Final out

Check it there:
Did not find new features since 3.1beta, just bug fixes. 2D surface write intrinsics that were buggy in 3.1 Beta are now working for instance.
There is also a reference to GPUDirect in the "release highlights", but I did not find any documentation for it. GPUDirect seams to allow third party devices to do direct memory acces to the GPU memory (cf the anouncement here).

CUDA dynamic template parameters (2/2) : Emulation with metaprogramming

Following my last post, I will describe how template metaprogramming can be used to "emulate" dynamic kernel template parameters. This approach does not prevent explosion of the amount of generated code at compile time, but provides a simple interface to "transform" runtime variable values into kernel template parameters.

If you are not familiar with templates metaprogramming, I suggest you to read Andrei Alexandrescu's great book: Modern C++ Design: Generic Programming and Design Patterns Applied. This book comes with a small and simple metaprogramming library called Loki that is just a series of helper headers. My code relies on a few structures from it.

Code description
Here is what a kernel call looks like with this mechanism:

The actual kernel call is encapsulated into a functor (here MyKernelCallFunctor) that is passed, with the runtime variables, to the call function of what I called the KernelCaller. The KernelCaller "transform" the set of runtime variables passed to the call function into a set of compile time parameters passed to the functor in the form of a typelist.
The functor is in charge making the actual kernel call by extracting the template values from the typelist, and by passing to the __global__ function the arguments previously passed through it's constructor and stored into the structure. It has to be written for each kernel you want to launch:

The KernelCaller is the core of the mechanism. It operates by recursively adding static values into the typelist by testing the value of the runtime variables. In reallity the compiler will generate code for all the branches and so will instanciate all possible sets of compile time parameters as typelists, the right one being selected at runtime. To ease usage, the runtime variables are passed using a variable argument list and transformed into a vector containing the values.

The compile time recursion is stopped thanks to the number of arguments passed as a template parameter to the KernelCaller.  :

Here it is, as is we already have a working mechanism able to deal with arbitrary number of boolean parameters !

Extending the approach to integer parameters
The approach as presented here works well for boolean template parameters. If we want to deal with more values than 0/1 per variable, we need to write something like this:

But this is not very funny to write and not very scalable. So once again we can rely on templates to generate the code for us !
In this case here is what the KernelCaller::call function looks like:

And here is the ValueSelector mechanism:

Loki structures
Here are the Loki structures we rely on:

Full source code
A full running sample code can be downloaded here.

Have fun, and... sorry for this crazy long post !

[Update June 21]
Code updated: There was 2 bugs in the code :
  • Using a reference as first parameter of a vararg is forbiden by C++ and resulted in a crash on 32bits machines -> transformed into a pointer.
  • The MaxArgValue argument to the KernelCaller was not kept by the ValueSelector, resulting in the use of the default value of 10 whatever the user specified so it was very slow to compile even for boolean values.
I added two #defines to control the sample kernel call: SAMPLE_NUMBER_OF_PARAMS indicating the number of parameters to use and SAMPLE_NUMBER_OF_INTEGER_VALUES indicating the number of values to use for each paramater (2 for boolean parameters).
I also added a #pragma message in order to follow the recursive compilation. 
Dependency on cutil_inline.h removed.

CUDA dynamic template parameters (1/2) : My dream CUDA feature

One common problem when using templates to parametrize CUDA kernels (cf. my previous post) is to be able to dynamically select the set of template parameters to use for a call, depending on runtime variables. This usually leads to an exploding number of cascaded if/switch and a lot of code copy/paste to instantiate the whole parameters tree for each kernel call.
This situation is illustrated by the following code for boolean parameters:

In addition to the pain it is to write, such code results in the compilation of an exploding number of versions of the same kernel, one for each instantiated template configuration.

Dynamic template parameters with JIT Kernels compilation
There is a CUDA feature I am dreaming about for a few time now and that would solve both problems: dynamic template parameters. What I mean by this is the ability for a CUDA kernel to accept true C variables (containing runtime values) as integer template parameters. The syntax would simply look like this:

This feature would be implemented by taking advantage of a C-level JIT (Just In Time) kernel compilation (current CUDA JIT compiler operates at the PTX level). It implies recompiling the kernel at runtime with a new set of template parameters each time a value changed. It requires tracking the last value of each parameters so that recompilation happens only when necessary. To be a bit more efficient, generated code could also be cached in some way so that it can be reused.
This would change the kernel compilation paradigm to something closer to the OpenCL compiling model, but while keeping the nice CUDA-C syntax provided by nvcc.
That feature would be very useful, and it would be great if NVIDIA makes CUDA evolves in that direction, or if someone write a JIT CUDA-C compiler that allows that !

Emulating dynamic templates... with templates !
While waiting for that feature, dynamic integer template parameters can be partially emulated today... with template metaprogramming ! The idea is to instantiate the whole parameters tree at compile time using templates, and to select the right one at runtime, based on the variables.

More details coming in the next post !

Fast and Accurate Single-Pass A-Buffer using OpenGL 4.0+

One of the first thing I wanted do try on the GF100 was the new NVIDIA extensions that allows random access read/write and atomic operations into global memory and textures, to implement a fast A-Buffer !

It worked pretty well since it provides something like a 1.5x speedup over the fastest previous approach (at least I know about !), with zero artifact and supporting arbitrary number of layers with a single geometry pass. 

Sample application sources and Win32 executable:
Sources+executable+Stanford Dragon model
Additional models

Be aware that this will probably only run on a Fermi card (Forceware drivers >=257.15). In particular it requires: EXT_shader_image_load_store, NV_shader_buffer_load, NV_shader_buffer_store, EXT_direct_state_access.
Application uses freeglut in order to initialize an OpenGL 4.0 context with the core profile.

  • 'a' Enable/Disable A-Buffer
  • 's' Enable/Disable fragments sorting. Disable= closest fragment kept during resolve.
  • 'g' Swith between Alpha-Blending and Gelly resolve modes.
  • 'c' Enable/Disable alpha correction when in Alpha-Blending mode.
  • 't' Swith between using textures or global memory for A-Buffer storage.
  • '1'-'3' Change mesh (requires the additional models).
UPDATE: Files updated with a bug fix.

Basically an A-buffer is a simple list of fragments per pixel [Carpenter 1984]. Previous methods to implement it on DX10 generation hardware required multiple passes to capture an interesting number of fragments per pixel. They where essentially based on depth-peeling, with enhancements allowing to capture more than one layer per geometric pass, like the k-buffer, stencil routed k-buffer. Bucket sort depth peeling allows to capture up to 32 fragments per geometry pass but with only 32 bits per fragment (just a depth) and at the cost of potential collisions. All these techniques were complex and especially limited by the maximum of 8 render targets that were writable by the fragment shader.

This technique can handle arbitrary number of fragments per pixels in a single pass, with only limitation the available video memory. In this example, I do order independent transparency with fragments storing 4x32bits values containing RGB color components and the depth.

The idea is very simple: Each fragment is written by the fragment shader at it's position into a pre-allocated 2D texture array (or a global memory region) with a fixed maximum number of layers. The layer to write the fragment into is given by a counter stored per pixel into another 2D texture and incremented using an atomic increment (or addition) operation ( [image]AtomicIncWrap or [image]AtomicAdd). After the rendering pass, the A-Buffer contains an unordered list of fragments per pixel with it's size. To sort these fragments per depth and compose them on the screen, I simply use a single screen filling quad with a fragment shader. This shader copy all the pixel fragments in a local array (probably stored in L1 on Fermi), sort them with a naive bubble sort, and then combine them front-to-back based on transparency.

To compare performances, this sample also features a standard rasterization mode which renders directly into the color buffer. On the Stanford Dragon example, a GTX480 and 32 layers in the A-Buffer, the technique range between 400-500 FPS, and is only 5-20% more costly than a simple rasterization of the mesh.
I also compared performances with the k-buffer which code is available online (still be careful, it may not be super optimized). On the GTX480, with the same model and shading (and 16 layers), I can get more than a 2x speedup. Based on that results, I strongly believe that it is also close to 1.5x faster than the bucket sort depth peeling, without it's depth collision problems.

EDIT: Artifacts in stencil-routed k-buffer came from a bug in DXUT, images removed. Also added a warning about the performances of the k-buffer OpenGL code from Louis Bavoil page.

EDIT 2: The follow-up of this work using per-pixel linked-lists can also be read there:

My new toy

On Friday I received a GeForce GTX 480 sent by NVIDIA to the lab !
It will nicely replace my dusty GTX 280 :-D

Fermi output queues and L2 write combining experiments

A guy from Los Alamos compared the performances (between Tesla 2 and Fermi) of output queues using atomic-add on an integer index per queue. First result : 16x speedup on Fermi !

Its is supposedly thanks to the coalescing of atomic operation that may be done in the L2 cache.

He also did another experiment to see if the L2 cache allows combining writes from different blocks into global memory, and it appears to be the case when you have consecutive blocks writing to the same cache line at the same time. Result: 3.25x speedup on Fermi.

Fermi based GPU Supercomputer #2 in Top500 !

I always known that day would come !

The June 2010 Top500 list of the world’s fastest supercomputers was released this week at ISC 2010.  While the US Jaguar supercomputer (located at the Department of Energy’s Oak Ridge Leadership Computing Facility) retained the top spot in Linpack performance, a Chinese cluster called Nebulae, built from a Dawning TC3600 Blade system with Intel X5650 processors and NVIDIA Tesla C2050 GPUs is now the fastest in theoretical peak performance at 2.98 PFlop/s and No. 2 with a Linpack performance of 1.271 PFlop/s. This is the highest rank a GPU-accelerated system, or a Chinese system, has ever achieved on the Top500 list.
For more information, visit

Copyright © Icare3D Blog
Designed by Templates Next | Converted into Blogger Templates by Theme Craft