"How the GPU works" @c0de517e

I rediscovered a very good in-depth explanation on how GPU works published in 2008 on c0de517e blog:
Part 1 Part 2 Part 3

CUDA "Better Performance at Lower Occupancy" @GTC2010

A friend point me this very interesting talk at NVIDIA GTC:
Better Performance at Lower Occupancy

They deny two common fallacies that CUDA developer usually believe in:

  • Multithreading is the only way to hide latency on GPU
  • Shared memory is as fast as registers

All the GTC2010 presentations can be found there (with slides and videos !):

Old Real-Time GPU Raytracer

I just translated from French to English an old page on my website about a real-time GPU raytracer I developed for fun 4 years ago, during my Master Thesis. It is old school GPGPU in OpenGL and Cg that can run on an NV40 (GeForce 6800). No need for CUDA or a GF110 to do GPU raytracing ! ;-)
The application also features a slow and unoptimized CPU raytracer.

See there: http://www.icare3d.org/myprojects/opengl_projects/raytracer_gpu_full_1.0.html

PS: It is funny to see what was possible at this time, but it was developed quickly and the shader code itself is not a reference !

Fluid Simulation for Video Games @INTEL

There is a very interesting series of article about fluid simulations for video games written by Michael J. Gourlay on intel developer website. Source code is also provided.
Parts: 1, 2, 3, 4, 5, 6, 7, 8

CUDA 3.2 Final released

Download it on NVIDIA Developer website !

NVIDIA Fermi GPU and Architecture Analysis @Beyond3D

The article is 3 weeks old but I just read it. Beyond3D published a very good analysis of the Fermi architecture. It is based on many homemade tests they developed to bench individual parts of the GF100 chip. Based on these analysis, they made interesting discoveries and speculations on the GF100 architecture.

In this article, I also discovered "Pomegranate", a parallel hardware architecture for polygon rendering developed at Stanford and that seems to be very close to the way Fermi handle parallel work distribution of the different steps of the graphics pipeline. Pomegranate [Eldrige et al, 2000]

Discussions are on Beyond3D Forum.

Here are some interesting statements:

Read more »

First reviews of the NVIDIA GF110: GTX580

The GF110 is the new high-end GPU from NVIDIA based on a renewed Fermi architecture. Even if the chip has not been officially launched, reviews starts already to appear online !

In french:

It seems reviews are pretty good !
To sum-up, full speed FP16 texture filtering, Z-cull performances improved, architectural tweaks, 15-20% performance improvements over GTX480 in games, less power consumption, quieter, cooler.
According to techreport, an interesting subtle change is that the 16/48KB local storage partition can be configured by the driver for graphics contexts, while it was only configurable in compute on the GF100.

Congrats NVIDIA :-)

Texture and buffer access performance on Evergreen architecture @rastergrid.com

A very interesting article about textures and buffer access performances in OpenGL on AMD Evergreen architecture:

Various stuff from October

I did not have much time to update this blog lately, so here are some interesting stuff I did not post during October:

OpenGL SuperBible Fifth Edition

Last week, Addison Wesley kindly sent me a copy of the Fifth Edition of the OpenGL SuperBible so that I can write a review of it. So let's do that :-)

The OpenGL SuperBible has been a reference book since the first release and this fifth edition is the first edition to be exclusively focused on modern, shader based, OpenGL programming. That's the great novelty of this edition: it is based on the OpenGL 3.3 API and all discussions about deprecated fixed-function programming has been thrown out of the book.

Read more »

Visual Transistor-level Simulation of the 6502 CPU

If you have ever dreamed to see a chip operating at a transistor level, take a look at this crazy project presented at Siggraph this year: a transistor level simulator of the 6502 CPU (that was powering the Apple 2) !
Greg James, Barry Silverman, Brian Silverman who are leading this project built the simulator by reverse engineering the chip from high resolution die shots they used to reconstruct the full polygon model of the chip circuits !

They provide the simulator as a javascript applet you can use to program the virtual chip and see the circuits operating : http://visual6502.org/JSSim/index.html 

The Siggraph talk can be found there: http://visual6502.org/docs/6502_in_action_14_web.pdf

Realistic 3D projection on a building

Awesome !

NVIDIA nextgen Kepler and Maxwell architectures codenames unveiled

During the opening keynote of the GPU Technology Conference, Jen-Hsun Huang unvailled the codenames and the roadmap for the next 2 generation of NVIDIA GPU architectures !

So we now publicly know that Johannes Kepler and James Clerk Maxwell are the two next scientists that will succeed to Enrico Fermi !

Sources : 

Read more »

OpenGL 4.2+ speculations @g-truc

An awesome post from Christophe Riccio about his speculation for next OpenGL releases:

Here is the summary:

Read more »


Available to registered developers.

Here are the interesting new stuff I found:

  • Support for malloc() and free() in kernels: dynamic global memory allocation !
  • This is implemented with a new syscall linking mechanism that seems to allow kernel to be linked to precompiled system calls. Infos on the linking mechanism (.calltargets , .callprototype ) can be found in section 10.3 of the PTX ISA manual. I hope this mechanism will get exposed for user functions in the API !
    • 64 bits addressing support in CUDA driver AP: Allows manipulating more than 4GB of device memory.
    • New System Management Interface (nvidia-smi) for reporting various hardware counters informations
    • New stream synchronization function cudaStreamSynchronize(): allow GPU-side inter-streams synchronisation.
    • A set of new calls is available to allow the creation of CUDA devices with interoperability with Direct3D devices that use SLI in AFR (Alternate Frame Rendering) 
    • New flag to driver API texture reference (CU_TRSF_SRGB), which enables sRGB->linear conversion on a read.
    • Reference manual adds architecture information on GF10x (GF104, GF106, GF108) class hardware (compute capability 2.1)
    Changes in PTX ISA 2.2:
    • Add tld4 (fetch4) instruction for loading a component (r, g, b, or a) from the four texels compising the bilinear interpolation footprint of a given texture location.
    • Add kernel pointer parameter state space and alignment of the memory being pointed to.

    New CUDA Libraries
    • CUSPARSE, supporting sparse matrix computations.
    • CURAND, supporting random number generation for both host and device code with Sobel quasi-random and XORWOW pseudo random routines.

    Official CUDA disasembler for sm_1.x real ISA

    Yesturday NVIDIA released an official disasembler for sm_1.x (pre-Fermi) real hardware ISA. It's like an official version of DECUDA :-) (that Wladimir stopped to develop)
    It takes either an ELF CUDA binary, a cubin or even an exe file, and provides the low level assembly code of the CUDA kernels.
    It is only available for registered developer for now, but you can get a little more information the CUDA forum.

    That's something a lot of developers have been asking for for a while. That allows to see the impact of optimizations on the real microcode, and it is particularly important for register usage for instance (since registers allocations is done after the PTX level).
    Nice NVIDIA finally end up unveiling it's real hardware ISA instructions. AMD is still a little bit ahead on this since the ISA instructions and microcode is available even for the Evergreen architecture (RV870): http://developer.amd.com/gpu/ATIStreamSDK/assets/AMD_Evergreen-Family_ISA_Instructions_and_Microcode.pdf

    What exactly is a doctorate @Gizmodo

    What is EXACTLY a PhD  ? The answer is there :

    "Keep pushing" !

    OpenGL 4.1 review @g-truc + Samples pack

    OpenGL 4.1 Review: http://www.g-truc.net/post-tech-lastest-full.html

    Christophe also published an updated version of his sample pack, with a nice table showing how the new stuff works on both NVIDIA and ATI (Beta drivers): http://www.g-truc.net/post-tech-lastest-full.html#post0321

    UPDATE: Another good article about OpenGL 4.1: http://rastergrid.com/blog/2010/08/an-introduction-to-opengl-4-1/

    Hacking D3D virtual calls + v-table hacking made useful @Humus

    Direct3D API is a fully C++ object-oriented API and rely on runtime polymorphisms (virtual fonction calls) to be expendable and easily being able to provide different implementations. So all API calls are virtual calls instead of being plain C calls like in OpenGL.
    Every slightly experimented C++ developer knows that virtual functions calls introduce overhead and that they should be avoided inside inner loops. Humus shows how these virtual calls can be replaced by standard calls by hacking the API objects v-table in order to keep a plain C pointer on these virtual methods !

    I love this kind of hack ! But as Humus explains, D3D (like OpenGL since OpenGL 3.0) do not rely on immediate mode anymore, that means that API calls are usually consuming a slightly large amount of cycle compared to the overhead of a virtual call.
    That means that in practice you wont get significant performance gain from this hack, but this is just really cool :-D And this method could still be useful to overcome performance problems in more badly design APIs !

    UPDATE : D3D v-table hacking... made useful !
    Humus just published another trick that shows how hacking the v-table of the D3D context can be used to... replace the default API calls by your own enhanced calls !
    Humus shows how this can be useful to count the number of time an API function is called for instance. This can be done by overwriting the original object v-table pointer with the address of your own v-table. More details there :

    Various GPU stuff from Siggraph time

    NVIDIA Siggraph 2010 presentations available in streaming there.

    Siggraph 2010 Khronos OpenGL BOF and OpenCL BOF slides available.

    Reference pages for OpenGL 3.3 and OpenGL 4.1 are online on opengl.org !
    • I already said it, but I love the way OpenGL has been evolving since OpenGL 3.0 ! It really seems to be a real willpower from the vendors to make it again a first class innovative API :-)

    OptiX 2.0, and Cg Toolkit 3.0 released by NVIDIA
    • SM 5 support in Cg at last ! 
    • I tried OptiX (previoulsy NVIRT) recently and I was really impressed, especially by the easiness of usage of the "high level" optixu interface. That's really an awesome tool.

    OpenGL 4.1 Specifications released + NVIDIA drivers

    The specifications of OpenGL 4.1 just got released by the Khronos group (But why didn't they wait for the OpenGL BOF ??).

    It does not bring a lot of new features, but it's still great to see OpenGL quickly evolving ! Direct State Access does not get into the core yet (sorry Christophe ;-), and I am not sure we will get it before OpenGL 5.0...

    As usual, NVIDIA is very likely to announce the release of drivers supporting OpenGL 4.1 during the OpenGL BOF :-)

    OpenGL.org forum official thread here.

    Here are the main new features:
    • Viewport Array (ARB_viewport_array). This is, for me, the most interesting new feature. It allows to manipulate multiple viewports inside a given render call. Viewports control the behavior of the "viewport transformantion" stage (view space -> window coordinates, scissor test). Multiple viewports can be created and the geometry shader can direct emitted primitives to a selected viewport. A separate viewport rectangle and scissor region can be specified for each viewport.
    • Ability to get the binary representation of a program object (ARB_get_program_binary). This is a long-awaited feature present in DX for a while.
    • Separate shader objects (ARB_separate_shader_objects). It allows to compile and to to link a separate program for each shader stage (PS/GS/TCS/TES/FS). A Program Pipeline Object is introduced to manipulate and bind the separate programs. That's also a useful features, and that was the way to do in Cg.
    • Improved compatibility with OpenGL ES 2.0 (ARB_ES2_compatibility). Adds a few missing functions and tokens.
    • Support for 64bits vertex attributes in GLSL (ARB_vertex_attrib_64bit).
    • Increases required size for textures/renderbuffers.

      Some interesting new extensions were also released:
      • ARB_debug_output: Callback mechanisms to receive enhanced errors and warning messages.
      • ARB_robustness: Address multiple specific goals to improve robustness, for example when running WebGL applications. For instance it provide additional "safe" APIs that bound the amount of data returned by an API query.
      • ARB_shader_stencil_export: Ability to set stencil values in a fragment shader for enhanced rendering flexibility :-)
      • ARB_cl_event: Link OpenGL sync objects to OpenCL event objects for enhanced OpenCL interoperability. 

      UPDATE 27/07: That's done, NVIDIA released it's OpenGL 4.1 drivers ! Everything there.

          OpenGL 4.0+ ABuffer V2.0: Linked lists of fragment pages

          The main problem with my first ABuffer implementation (cf. my previous post) was that a fixed maximum number of fragments per pixel has to be allocated at initialization time. With this approach, the size of the ABuffer can quickly become very large when the screen resolution and depth complexity of the scene increase.

          Using linked lists of fragment pages per pixel

          Original basic approach

          To try to solve this problem, I implemented a variant of the recent OIT method presented at the GDC2010 by AMD and using per-pixel linked lists. The main difference in my implementation is that fragments are not stored and linked individually but into small pages of fragments (containing 4-6 fragments). Those pages are stored and allocated in a shared pool whose size is changed dynamically depending on the scene demands.
          Using pages allows to increase the cache coherency when accessing the fragments, improve the efficiency of concurrent access to the shared pool and decrease the storage cost of the links. This is at the cost of a slight over-allocation of fragments.
          The shared pool is composed of a fragment buffer where fragment data is stored, and a link buffer storing links between the pages that are reverse chained. Each pixel of the screen contains the index of the last page it references, as well as a counter with the total number of fragments stored in that pixel (incremented using atomics).
          The access to the shared pool is manage through a global page counter, incremented using an atomic operation each time a page is needed by a fragment. The allocation of a page is done by a fragment when it detects that the current page is full, or there is not any page yet for the pixel. This is done inside a critical section to unsure that multiple fragments together in the pipeline and falling into the same pixel will be handled correctly.

          ABuffer memory occupancy differences:

          Some memory occupancy examples of the fragments storage depending on screen resolution (Basic vs Linked Lists):
          • 512x512:    64MB vs 6.5MB 
          • 768x708:   132.7MB vs 11.7MB
          • 1680x988:  405MB vs 27.42MB

          The cost of this huge reduction of the storage need is that the rendering speed decreases compared to the basic approach. Linked lists can be down to half the speed of the basic approach when per-fragment additional costs are low, due to the additional memory access and the increased complexity of the fragment shader (more code, more registers). But this cost seems well amortized when the shading costs per-fragment increase.

          Order Independent Transparency (OIT) demo application & source code
          New keys:
          • 'x' : Switch between ABuffer Algorithms (V1.0 Basic and V2.0 Linked List)
          • 'n' : Display the number of fragments per pixel.
          • 'g' : Swith between Alpha-Blending and Gelly resolve modes.

          UPDATE 28/10/2010: Oscarbg did a port of the demo so that it can run on AMD (mainly removing everything related to shader_load/store), more info there:
          But sadly still does not work on AMD, so if an AMD guy read that, your help is welcome !
          I can't try myself since I don't have any AMD card :-(

            CUDA 3.1 Final out

            Check it there: http://developer.nvidia.com/object/cuda_3_1_downloads.html
            Did not find new features since 3.1beta, just bug fixes. 2D surface write intrinsics that were buggy in 3.1 Beta are now working for instance.
            There is also a reference to GPUDirect in the "release highlights", but I did not find any documentation for it. GPUDirect seams to allow third party devices to do direct memory acces to the GPU memory (cf the anouncement here).

            CUDA dynamic template parameters (2/2) : Emulation with metaprogramming

            Following my last post, I will describe how template metaprogramming can be used to "emulate" dynamic kernel template parameters. This approach does not prevent explosion of the amount of generated code at compile time, but provides a simple interface to "transform" runtime variable values into kernel template parameters.

            If you are not familiar with templates metaprogramming, I suggest you to read Andrei Alexandrescu's great book: Modern C++ Design: Generic Programming and Design Patterns Applied. This book comes with a small and simple metaprogramming library called Loki that is just a series of helper headers. My code relies on a few structures from it.

            Code description
            Here is what a kernel call looks like with this mechanism:

            The actual kernel call is encapsulated into a functor (here MyKernelCallFunctor) that is passed, with the runtime variables, to the call function of what I called the KernelCaller. The KernelCaller "transform" the set of runtime variables passed to the call function into a set of compile time parameters passed to the functor in the form of a typelist.
            The functor is in charge making the actual kernel call by extracting the template values from the typelist, and by passing to the __global__ function the arguments previously passed through it's constructor and stored into the structure. It has to be written for each kernel you want to launch:

            The KernelCaller is the core of the mechanism. It operates by recursively adding static values into the typelist by testing the value of the runtime variables. In reallity the compiler will generate code for all the branches and so will instanciate all possible sets of compile time parameters as typelists, the right one being selected at runtime. To ease usage, the runtime variables are passed using a variable argument list and transformed into a vector containing the values.

            The compile time recursion is stopped thanks to the number of arguments passed as a template parameter to the KernelCaller.  :

            Here it is, as is we already have a working mechanism able to deal with arbitrary number of boolean parameters !

            Extending the approach to integer parameters
            The approach as presented here works well for boolean template parameters. If we want to deal with more values than 0/1 per variable, we need to write something like this:

            But this is not very funny to write and not very scalable. So once again we can rely on templates to generate the code for us !
            In this case here is what the KernelCaller::call function looks like:

            And here is the ValueSelector mechanism:

            Loki structures
            Here are the Loki structures we rely on:

            Full source code
            A full running sample code can be downloaded here.

            Have fun, and... sorry for this crazy long post !

            [Update June 21]
            Code updated: There was 2 bugs in the code :
            • Using a reference as first parameter of a vararg is forbiden by C++ and resulted in a crash on 32bits machines -> transformed into a pointer.
            • The MaxArgValue argument to the KernelCaller was not kept by the ValueSelector, resulting in the use of the default value of 10 whatever the user specified so it was very slow to compile even for boolean values.
            I added two #defines to control the sample kernel call: SAMPLE_NUMBER_OF_PARAMS indicating the number of parameters to use and SAMPLE_NUMBER_OF_INTEGER_VALUES indicating the number of values to use for each paramater (2 for boolean parameters).
            I also added a #pragma message in order to follow the recursive compilation. 
            Dependency on cutil_inline.h removed.

            CUDA dynamic template parameters (1/2) : My dream CUDA feature

            One common problem when using templates to parametrize CUDA kernels (cf. my previous post) is to be able to dynamically select the set of template parameters to use for a call, depending on runtime variables. This usually leads to an exploding number of cascaded if/switch and a lot of code copy/paste to instantiate the whole parameters tree for each kernel call.
            This situation is illustrated by the following code for boolean parameters:

            In addition to the pain it is to write, such code results in the compilation of an exploding number of versions of the same kernel, one for each instantiated template configuration.

            Dynamic template parameters with JIT Kernels compilation
            There is a CUDA feature I am dreaming about for a few time now and that would solve both problems: dynamic template parameters. What I mean by this is the ability for a CUDA kernel to accept true C variables (containing runtime values) as integer template parameters. The syntax would simply look like this:

            This feature would be implemented by taking advantage of a C-level JIT (Just In Time) kernel compilation (current CUDA JIT compiler operates at the PTX level). It implies recompiling the kernel at runtime with a new set of template parameters each time a value changed. It requires tracking the last value of each parameters so that recompilation happens only when necessary. To be a bit more efficient, generated code could also be cached in some way so that it can be reused.
            This would change the kernel compilation paradigm to something closer to the OpenCL compiling model, but while keeping the nice CUDA-C syntax provided by nvcc.
            That feature would be very useful, and it would be great if NVIDIA makes CUDA evolves in that direction, or if someone write a JIT CUDA-C compiler that allows that !

            Emulating dynamic templates... with templates !
            While waiting for that feature, dynamic integer template parameters can be partially emulated today... with template metaprogramming ! The idea is to instantiate the whole parameters tree at compile time using templates, and to select the right one at runtime, based on the variables.

            More details coming in the next post !

            Fast and Accurate Single-Pass A-Buffer using OpenGL 4.0+

            One of the first thing I wanted do try on the GF100 was the new NVIDIA extensions that allows random access read/write and atomic operations into global memory and textures, to implement a fast A-Buffer !

            It worked pretty well since it provides something like a 1.5x speedup over the fastest previous approach (at least I know about !), with zero artifact and supporting arbitrary number of layers with a single geometry pass. 

            Sample application sources and Win32 executable:
            Sources+executable+Stanford Dragon model
            Additional models

            Be aware that this will probably only run on a Fermi card (Forceware drivers >=257.15). In particular it requires: EXT_shader_image_load_store, NV_shader_buffer_load, NV_shader_buffer_store, EXT_direct_state_access.
            Application uses freeglut in order to initialize an OpenGL 4.0 context with the core profile.

            • 'a' Enable/Disable A-Buffer
            • 's' Enable/Disable fragments sorting. Disable= closest fragment kept during resolve.
            • 'g' Swith between Alpha-Blending and Gelly resolve modes.
            • 'c' Enable/Disable alpha correction when in Alpha-Blending mode.
            • 't' Swith between using textures or global memory for A-Buffer storage.
            • '1'-'3' Change mesh (requires the additional models).
            UPDATE: Files updated with a bug fix.

            Basically an A-buffer is a simple list of fragments per pixel [Carpenter 1984]. Previous methods to implement it on DX10 generation hardware required multiple passes to capture an interesting number of fragments per pixel. They where essentially based on depth-peeling, with enhancements allowing to capture more than one layer per geometric pass, like the k-buffer, stencil routed k-buffer. Bucket sort depth peeling allows to capture up to 32 fragments per geometry pass but with only 32 bits per fragment (just a depth) and at the cost of potential collisions. All these techniques were complex and especially limited by the maximum of 8 render targets that were writable by the fragment shader.

            This technique can handle arbitrary number of fragments per pixels in a single pass, with only limitation the available video memory. In this example, I do order independent transparency with fragments storing 4x32bits values containing RGB color components and the depth.

            The idea is very simple: Each fragment is written by the fragment shader at it's position into a pre-allocated 2D texture array (or a global memory region) with a fixed maximum number of layers. The layer to write the fragment into is given by a counter stored per pixel into another 2D texture and incremented using an atomic increment (or addition) operation ( [image]AtomicIncWrap or [image]AtomicAdd). After the rendering pass, the A-Buffer contains an unordered list of fragments per pixel with it's size. To sort these fragments per depth and compose them on the screen, I simply use a single screen filling quad with a fragment shader. This shader copy all the pixel fragments in a local array (probably stored in L1 on Fermi), sort them with a naive bubble sort, and then combine them front-to-back based on transparency.

            To compare performances, this sample also features a standard rasterization mode which renders directly into the color buffer. On the Stanford Dragon example, a GTX480 and 32 layers in the A-Buffer, the technique range between 400-500 FPS, and is only 5-20% more costly than a simple rasterization of the mesh.
            I also compared performances with the k-buffer which code is available online (still be careful, it may not be super optimized). On the GTX480, with the same model and shading (and 16 layers), I can get more than a 2x speedup. Based on that results, I strongly believe that it is also close to 1.5x faster than the bucket sort depth peeling, without it's depth collision problems.

            EDIT: Artifacts in stencil-routed k-buffer came from a bug in DXUT, images removed. Also added a warning about the performances of the k-buffer OpenGL code from Louis Bavoil page.

            EDIT 2: The follow-up of this work using per-pixel linked-lists can also be read there: http://blog.icare3d.org/2010/07/opengl-40-abuffer-v20-linked-lists-of.html

            My new toy

            On Friday I received a GeForce GTX 480 sent by NVIDIA to the lab !
            It will nicely replace my dusty GTX 280 :-D

            Fermi output queues and L2 write combining experiments

            A guy from Los Alamos compared the performances (between Tesla 2 and Fermi) of output queues using atomic-add on an integer index per queue. First result : 16x speedup on Fermi !

            Its is supposedly thanks to the coalescing of atomic operation that may be done in the L2 cache.

            He also did another experiment to see if the L2 cache allows combining writes from different blocks into global memory, and it appears to be the case when you have consecutive blocks writing to the same cache line at the same time. Result: 3.25x speedup on Fermi.

            Fermi based GPU Supercomputer #2 in Top500 !

            I always known that day would come !

            The June 2010 Top500 list of the world’s fastest supercomputers was released this week at ISC 2010.  While the US Jaguar supercomputer (located at the Department of Energy’s Oak Ridge Leadership Computing Facility) retained the top spot in Linpack performance, a Chinese cluster called Nebulae, built from a Dawning TC3600 Blade system with Intel X5650 processors and NVIDIA Tesla C2050 GPUs is now the fastest in theoretical peak performance at 2.98 PFlop/s and No. 2 with a Linpack performance of 1.271 PFlop/s. This is the highest rank a GPU-accelerated system, or a Chinese system, has ever achieved on the Top500 list.
            For more information, visit www.TOP500.org.
            Source: gpgpu.org

            Why Intel Larrabee Really Stumbled @BSN

            After Intel officially admitted (through this blog post, also an interesting article here) that Larrabee is not going to play in the consumer gaming market in the "near future", BsN publish an interesting post-mortem article from Andrew Richards : Why Intel Larrabee Really Stumbled: Developer Analysis

            The curse of the Stanford Bunny

            At a given point of your PhD, even trees start looking like the Stanford Bunny !

            And it's not Photoshopped (I saw it during a hike near Grenoble).

            OpenGL 4.0 review @G-Truc

            Just discovered a good review of OpenGL 4.0 made by Christophe Riccio "Groovounet", I missed it in March.


            CUDA 3.1Beta out

            NVIDIA released a beta version of the CUDA 3.1 toolkit for register developers.

            New features from the programming guide :

            • 16bits float textures supported by the runtime API. __float2half_rn() and __half2float() intrinsic added (Table C-3).
            • Surface memory interface exposed in the runtime API (Section 3.2.5, B9). Read/Write access into textures (CUDA Arrays). But limited to 1D and 2D Arrays yet.
            • Up to 16 parallel kernel launches on Fermi (it was only 4 in CUDA 3.0). Not sure how it is really implemented (one per SM ? multiple per SM ?).
            • Recursive calls supported in device function on Fermi (B.1.4). Stack size query and setting functions added (cudaThreadGetLimit(), cudaThreadSetLimit()).
            • Function pointers supported on device functions on Fermi (B.1.4). Function pointers to global functions supported on all GPUs.
            • Just noticed that a __CUDA_ARCH__ macro allowing to write different code paths depending on the architecture (or code executed on the host) is here since CUDA 3.0 (B.1.4).
            • printf support into kernels integrated into the API for sm_20 (B.14). Note that a cuprintf supporting all architectures was provided to register developers a few months ago.
            • New __byte_perm(x,y,s) intrinsic (C.2.3).
            • New __forceinline__ function qualifier to force inlining on Fermi. A __noinline__ was also present already to allow forcing function call on sm_1.x
            • New –dlcm compilation flag to specify global memory caching strategy on Fermi (G.4.2).

            Interesting new stuff in the Fermi Compatibility Guide:
            • Just-in-time kernel compilation can be used with the runtime API with R195 drivers (Section 1.2.1).
            • Details using the volatile keyword for intra-warp communications (Section 1.2.2).

            Interesting new stuff in the Best Practice Guide:
            • Uses signed integer instead of unsigned as loop counter. It allows the compiler to perform strength reduction and can provides better performances (Section 6.3).

            Jeremy Sugerman's PhD defense on GRAMPS

            Jeremy Sugerman defended his PhD yesterday, the slides from his talk on GRAMPS can be found there:

            Programming Many-Core Systems with GRAMPS

            Here is also a talk he did on GRAMPS at UCDavis in February :

            Why you should use OpenGL and not DirectX @Wolfire

            A quite old stuff (from January) I just read again. I like this point of view :-)


            Bit hacks and low level algorithms

            Here are 3 good places to find bit manipulation hacks and efficient low level algorithms for various mathematical functions:

            Bit Twiddling Hacks (Stanford)
            The Aggregate Magic Algorithms (University of Kentucky)
            HAKMEM (MIT)

            If you have other ones like this, do not hesitate to post them in the comments !

            Weta Digital Experience

            I just come back from 3 months in New-Zealand, working with Weta Digital. It was great, a very nice and interesting experience !

            NVIDIA GT200 microbenchmarking

            A crazy paper from university of Toronto:
            Demystifying GPU Microarchitecture through Microbenchmarking

            This  work  develops  a  microbechmark  suite  and  measures  the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. Various undisclosed characteristics of the processing elements and the memory hierarchies are measured.

            CUDA Template Metaprogramming

            CUDA is awesome and, for me, one of the reason I think it is better than OpenCL is because of its support of C++ templates.

            I have been using templates in CUDA for quite a long time now, and in addition to the classical "generic programming" advantages (generic types, functors...), using templates allows for a lot of optimizations in kernel functions.

            First, templated values ( template<uint i>... ) can be used as compile time constants. For instance, blockDim is very often known and fixed at compile time. Passing it through a templated value instead of relying on the built-in variable allows faster access, since its value is directly integrated as a constant in the asm. The compiler can optimize some operations, if the constant is a power of two for instance, multiplications and divisions will be transformed into bit-shifts.

            Even more interestingly, you can help the compiler in many cases where it would not optimize itself, by implementing the optimizations yourself using template evaluation. Such usage of templates is called template metaprogramming. C++ templates are turing-complete, that means you can implement any computation you want so that it will be evaluated at compile time by the template processor.
            For instance, I am not sure the compiler will detect when you are passing a constant to a function like log2(). But you can implement the compile time (recursive) evaluation of log2 very easily with templates:

            Template Metaprogramming libraries exist and provide a lot of very advanced and powerful features. I am personally using Loki that is the library written by Andrei Alexandrescu as part of his (awesome) book Modern C++ Design. I am mainly using Typelist and Type manipulation features and they compile perfectly with CUDA 2.3.

            Nature's renderer is awesome

            Some impressive pictures of the Eyjafjallajokull volcano currently freezing European air traffic:

            Even if it is so beautiful, it would be nice that this pretty little volcano stops doing it's teenage angst so that I can come back to my home place next week !

            CUDA: Beware of the structs...

            ... and unions in local variables, they eat kittens !

            PS: And in many situations, they also fall down in local memory. So if you are writing a ray tracer, do not use a Ray structure !

            Three big lies (of Software Development)

            Insomniac Games Head to Engine Director Mike Acton  @GDC 2010

            (Lie #1) Software is a platform

            "The reality is software is not a platform. You can't idealize the hardware. And the constants in the "Big-O notation" that are so often ignored, are often the parts that actually matter in reality(...) You can't judge code in a vacuum. Hardware impacts data design."

            (Lie #2) Code should be designed around a model of the world
            "There is no value in code being some kind of model or map of an imaginary world (...) it is extremely popular. If there's a rocket in the game, rest assured that there is a "Rocket" class (...) which contains data for exactly one rocket and does rockety stuff (...) There are a lot of performance penalties for this kind of design, the most significant one is that it doesn't scale. At all. One hundred rockets costs one hundred times as much as one rocket. And it's extremely likely it costs even more than that !"

            (Lie #3) Code is more important than data
            "Code is ephimiral and has no real intrinsic value. The algorithms certainly do, sure. But the code itself isn't worth all this time (...). The code, the performance and the features hinge on one thing - the data."


            CUDA "volatile trick"

            A very useful trick found on the CUDA forum.

            Very often, the CUDA compiler inline the operations needed to compute the value of a variable used at several places, instead of keeping the variable in a register. This can be a good strategy in some situations, but there is also many cases where it brings register usage up unnecessarily and duplicates instructions. To prevent this, the "volatile" keyword can be used when the variable is declared, forcing it to be really kept and reused.
            This trick also work with constant variables (and shared memory) which would otherwise get loaded into registers over and over when accessed at several places.

            It clearly reduces the number of virtual registers allocated at the PTX level, which helps a lot for the real register allocation phase that happens later during the transform to cubin. However, be careful not using it with constantly indexed arrays for instance, they would be put in local memory.

            More info there:


            After the Mandelbulb (can also see my implementation here), here is the Mandelbox found on fractalforums.com !

            Larrabee New Instructions @drdobbs

            A First Look at the Larrabee New Instructions (LRBni), by Michael Abrash


            Not that I am passionated by larrabee -and I don't really believe in an x86 GPU-, but it is still interesting to see what choices has been made for the ISA. After the announced cancellation of the first Larrabee chip as a GPU, I have heard rumors saying that it could still be proposed to HPCs (and it seems that the Larrabee GPU could be rescheduled in a few years).

            CUDA PTX 2.0 (Fermi) specification released

            NVIDIA made available the specification of the PTX 2.0 ISA for Fermi, this can be downloaded there:

            Among interesting things I saw :
            • New Texture, Sampler and Surface types: Opaque type for manipulating texture, sampler and surface descriptor as normal variables. -> More flexible texture manipulation, allow arrays of textures for instance.
            • New syntax for abstracting an underlining ABI (Application Binary Interface): define a syntax for function definition/calls, parameter passing, variadic functions, and dynamic memory allocation in the stack ("alloca"). -> true function calls, and recursivity ! But not yet implemented in CUDA 3.0.
            • New binary instructions  popc (population count, number of one bits), clz (count leading zeros), bfind (non significant non-sign bit), brev (bit reverse), bfe/bfi (bit field extract/insert, ?), prmt (permute)
            • Cache operators ( Allow to select (per operation) the level of caching in the cache hierarchy (L1/L2) of the load/store instructions.
            • Prefetch instructions (Table 84) that allows forcing the load of a page in global/local memory into a specific cache level).
            • Surface load/store (surd/sust, Tables 90/91): Read/Write (through ROPs ?) into render targets. (Support 3D R/W! Hum.. really working ?)
            • Video instructions: Vector operations on bytes/half-words/words.
            • Performance tuning directives (10.3): Allows to help the compiler to optimize the code based on bloc configurations.

            ARM CPUs analysis @bsn

            Interesting analysis of ARM last architectures and comparison with x86:


            AMD/ATI Stories @anandtech

            Anandtech published interesting insights into the strategy of ATI/AMD for it's two last successful architectures, threw interviews with Carrell Killebrew, engineering lead on RV770.

            The RV770 Story: Documenting ATI's Road to Success
            The RV870 Story: AMD Showing up to the Fight

            NVIDIA OpenGL 4.0 driver + extensions



            • Fermi/DX11 level shaders extensions (NV_gpu_program5, NV_tesssellation_program5, NV_gpu_shader5)
            • Global memory Load/Store from shaders ! (NV_shader_buffer_load, NV_shader_buffer_store, build upon "Bindless Graphics")
            • Texture Load/Store ! (EXT_shader_image_load_store)
            OpenGL is moving forward, yeah \o/

            Jan Vlietinck's DirectCompute fluid simulation


            It simulates an incompressible fluid (Navier- Stokes differential equations) through the well known velocity advection (Jos Stam method). The simulation of the video runs in a 200x200x200 voxel grid.

            Nice work Jan :-)

            Overcoming WDDM : Tesla compute-only driver on GeForce

            One big pain with CUDA under Windows Vista or Seven is that performances suffers a lot from limits and overheads imposed by the WDDM (Windows Display Driver Model) the driver has to comply to.
            This means slower kernel launches, limit on the size of memory allocations and a lot  of constraints that prevents NVIDIA to efficiently implement a lot of features in CUDA.

            Tim Murray on the CUDA forum:

            "Welcome to WDDM. Kernel launch overhead is ~3us on non-WDDM platforms. On WDDM, it's 40 at a minimum and can potentially be much larger. Considering the number of kernels you're launching in 10ms, that's going to add up."
            "WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers! As a result, you can't really do paging in a CUDA app, so you get zero benefit from WDDM. However, because it's the memory manager, we can't just go around it for CUDA because WDDM will assume it owns the card completely, start moving memory, and whoops your CUDA app just exploded. So no, there's not really some magic workaround for cards that can also be used as display."

            To overcome this problem, NVIDIA provides a compute-only drivers for Tesla boards. But with little effort it can be also be installed on GeForce.
            How to install them on GeForce:

            New blog

            After many years using my icare3d personal website as a kind of blog, I finally turned out to Blogger. Publishing things on my site involved too much formating time and I became lazy posting there. The intend of this new blog is to publish more regularly my thoughts and findings about GPUs, parallel programming, and computer graphics.
            The first posts are likely to be bunches of old stuffs I did not take time to post earlier.
            Hope you will enjoy it !

            Copyright © Icare3D Blog
            Designed by Templates Next | Converted into Blogger Templates by Theme Craft