Maxwell GM204 OpenGL extensions

NVIDIA just launched the Maxwell 2 architecture with GM204, and this is I believe, an incredible chip. The Maxwell 2 architecture is both highly energy efficient (~2x perf/watt of Kepler in games), and provides a lot of very exciting new graphics features (some of them are exposed in Direct3D). These features are exposed in form of new OpenGL extensions in the R344 driver that was released today, and the specification for all NVIDIA supported GL extensions can be found here. NVIDIA also released new SDK samples using these extensions.

Quick description of the new extensions


This feature adds a lot of flexibility to the multi-sampled rasterization, and decouples the rasterization sampling frequency (which can be set explicitly) from the actual framebuffer storage. It enables rasterization to operate at higher sampling frequency than the number of samples in the target color render buffers, and it supports both depth and stencil testing at this frequency, if the corresponding depth and stencil buffers are sampled accordingly (it must be a multiple of the number of samples in the color buffers).
There are still some constraints; All color buffers must have the same number of samples, and the raster sample count must match the depth and stencil buffer sample count if depth or stencil test is enabled, and it must be higher or equal to the color buffer sample count.

A new “coverage reduction stage” is introduced in the per-fragment operations (after the fragment shader in early-z mode, after the depth-test in late-z), which converts a set of covered raster/depth/stencil samples to a set of covered color samples. There is an implementation-dependent association of raster samples to color samples. The reduced "color coverage" is computed such that the coverage bit for each color sample is 1 if any of the associated bits in the fragment's coverage is set, and 0 otherwise. This feature can be used in conjunction with the coverage to color feature (cf. below), in order to get the FS output coverage mask automatically transformed into a color by ROP. According to AnandTech, it seems that when rasterizing with explicit multisampling and no render-target, GM204 allows evaluating primitive coverage at 16x MSAA.

Note that EXT_raster_multisample is equivalent to "Target-Independent Rasterization" in Direct3D 11.1, which allows using multiple raster samples with a single color sample, as long as depth and stencil tests are disabled, and it is actually a subset of NV_framebuffer_mixed_samples which is more general and exposes more flexibility.

This allows using ROP to automatically convert the post depth-/stencil-/alpha- test coverage mask into a color and write it into a color render target. This is performed before the new coverage reduction stage (cf. NV_framebuffer_mixed_samples). This can be useful for deferred shading.

This extension allows the fragment shader to get the post depth-test coverage mask of the current fragment as input (gl_SampleMaskIn[]) when operating in early-depth mode (for which only sample passing the depth-test are set), unlike the standard GL 4.5 behavior which provides the pre- depth-test coverage (actual triangle coverage).

The standard GL behavior for FS output coverage mask (gl_SampleMask[]) is to AND it with the actual primitive input coverage mask. This extension disables this operation, which allows the fragment shader to fully override the primitive coverage, potentially setting coverage bits that were not set in the input mask. This is actually very nice, because it allows dynamically routing color output values into arbitrary sample locations inside a multisampled render target.

Allows applications to explicitly set the location of sub-pixel samples for multisample rasterization, providing fully programmable sampling patterns. It seems that the sub-pixel positions are snapped to a 16x16 sub-pixel grid, and sampling patterns can be defined within a grid of 2x2 adjacent pixels.


This is a really great feature. It allows rasterization to generate fragments for any pixel touched by a triangle, even if no sample location is covered on the pixel. A new control is also provided to modify the window coordinate snapping precision in order to allow the application to match conservative rasterization triangle snapping with the snapping that would have occurred at higher resolution. Polygons with zero area generate no fragments. Any location within a pixel may be used for interpolating attributes, potentially causing attribute extrapolation if outside the triangle. This can be useful for binning purpose for instance (one pixel per-tile).

This extension exposes an hardware-accelerate critical section for the fragment shader, allowing hazard-free read-modify-write operations on a per-pixel basis. It also allows enforcing primitive-ordering for threads entering the critical section. It provides new GLSL functions beginInvocationInterlockNV() and endInvocationInterlockNV() that defines a critical section which is guaranteed to be executed for only one fragment at a time. Interlock can be done per-pixel or per-sample if multi-sampled rasterization is used. This is useful for algorithms that need to access per-pixel data structures via shader load and store operations, while avoiding race conditions. Obvious applications are OIT and programmable blending for instance.

This allows rasterizing the axis-aligned screen-space bounding box of submitted triangles, disregarding the actual triangle edges. This can be useful for drawing a full-screen quad without an internal edge for instance, or for more efficiently drawing user interfaces.

Geometry processing

This extension allows to write more efficient geometry shaders in the case there is a one-to-one mapping between input and output primitives, and per-vertex attributes are simply copied from the input primitive to corresponding outputs in the vertices of the output primitive, and the geometry shader is only used to set per-primitive attributes (like gl_Layer ... ).


This extension improves on ARB_sparse_texture, which separate the allocation of virtual address space from physical memory for textures, and provides the ability to sparsely allocate the physical backing-store of 2D/3D/2DArray textures on a per-tile basis. This new extension adds the ability to retrieve texture access residency information from GLSL, to specify minimum allocated LOD to texture fetches and to return a constant zero value for lookups into unallocated pages. It also adds support for multi-sampled textures.

Exposes a new sampler parameter allowing to perform a min or max reduction operation on the values sampled inside a texture filtering footprint, instead of the regular linear interpolation.


Unreal Engine 4 : real-time GI using voxel cone-tracing !

EPIC games just released two impressive videos presenting their new UE4 game engine featuring real-time global illumination that supports diffuse and specular bounces on dynamic scenes.

According to this interview of Tim Sweeney, their technique is called SVOGI (Sparse Voxel Octree Global Illumination) and... it is based on the voxel cone-tracing technique we published last year at Pacific Graphics ! :-D

That's really awesome to see research results being used so quickly in a game engine ! Beyond the paper, you can find more details about our voxel cone-tracing approach in my GTC 2012 talk, my Siggraph talk, as well as in my Ph.D thesis on GigaVoxels.

GTC 2012 Talk: "Octree-Based Sparse Voxelization for Real-Time Global Illumination"

This week I gave a talk at the NVIDIA GPU Technology Conference about a fast sparse voxelization technique that I worked on at NVIDIA in the context of my real-time global illumination approach (using voxel cone-tracing inside a sparse voxel octree).

Video (if you want to enjoy my french accent):

In this talk I first give an overview of the real-time GI approach, before quickly detailing the new GPU voxelization and octree construction algorithm.

This sparse voxelization technique will be published in the OpenGL Insights book that will be out for Siggraph.

[UPDADE 07/12] The book chapter in OpenGL Insights has been published online here.

Ph.D thesis: GigaVoxels

I defended my Ph.D thesis on GigaVoxels last July, and the document is now online.

You can download it there:
GigaVoxels: A Voxel-Based Rendering Pipeline For Efficient Exploration Of Large And Detailed Scenes

You can also check my other publications on my Ph.D webpage.

Interview on 3DVF

I gave an interview to 3DVF, a great French community website dedicated to image synthesis, and numerical creation in general. The interview is mainly centered around our voxel-based GI approach, as well as my former INRIA research team and computer graphics research in France.

The interview be read here (in French).

Ph.D defended and postoc at NVIDIA Research

I did not take time to report on this before, but I defended my Ph.D thesis on GigaVoxels at INRIA this summer (my thesis will be soon available there), and I started a one year postdoc at NVIDIA Research in the SF bay area. There, I continue working on voxel representations, of course :-D

Even if I am now an NVIDIA employee, this continue to be a personal blog, all the things I publish here are totally mine and my opinions do not necessarily reflect the ones from my employer !

Real-Time Shadows Book

Just a quick post to recommend you the great book "Real-time shadows" that details, explore and greatly explain a large number of relevant techniques for real-time shadow rendering, which we all know must be one of the most prolific field in computer graphics !
In addition, this book is written by 4 leading experts of the field: Elmar Eisemann, Michael Schwarz, Ulf Assarsson and Michael Wimmer.
For me, it is now THE reference on shadows rendering !

More info on the book website. and a good description on the publisher's website. Of course it can be bought on amazon.

Siggraph 2011 Slides

Here is a quick list of interesting Siggraph 2011 courses and talks available online.

Courses/Talks :
Advances in Real-Time Rendering in Games
Beyond Programmable Shading
State-of-the-Art:  Interactive Global Illumination
Destruction and Dynamics for Film and Game Production
Filtering Approaches for Real-Time Anti-Aliasing
Production Volume Rendering
Compiler Technology for Rendering
Liquid Simulation with mesh-based Surface Tracking

NVIDIA Slides , Videos

OpenGL/OpenCL :
OpenGL BOF Slides
Introduction to Modern OpenGL Programming
OpenCL BOF Slides

HPG 2011

Also a great list with more posters and talks is available on Stephen Hill's Blog
There is also my talk on Interactive Indirect Illumination Using Voxel Cone Tracing :-)

OpenGL 4.2 specification released

Specifications can be found in the GL registry:
NVIDIA drivers already supporting it as well as new ARB extensions can be found there:

A full and very good review can be found on G-Truc Creation:
Among interesting things, the shader_atomic_counters, conservative depth, instanced transformed feedback, the integration of shader_image_load_store and more !

And congrats to Christophe for his first credit in the spec :-D

"Interactive Indirect Illumination Using Voxel Cone Tracing" paper accepted at Pacific Graphics 2011

I am glad to announce that our paper "Interactive Indirect Illumination Using Voxel Cone Tracing" (cf. my previous post has been accepted at Pacific Graphics 2011 !

You can find the authors version of the paper on my research page :
On my INRIA webpage:

Also, don't forget to attend my talk if you are going to Siggraph 2011 !
Tuesday, 9 August 9:00 am - 10:30 am | West Building, Rooms 109/110

Free 3D meshes links

I started to compile links to websites where free 3D models can be found. If you know other good websites, feal free to post them in the comments :-)

Static models and scenes:

Animated models and scenes:

    Interactive Indirect Illumination and Ambient Occlusion Using Voxel Cone Tracing

    I am happy to present you a new approach to compute indirect illumination and ambient occlusion in real-time using a pre-filtered geometry representation stored inside a sparse voxel octree, as well as a new approximate voxel-based cone-tracing. The sparse voxel octree is used as a proxy for occlusion information and indirect illumination, it supports dynamic objects and environments thanks to a new very fast voxelization algorithm that updates the octree structure and pre-filter geometry and lighting information.

    This work has been done in collaboration with Miguel Sainz and Simon Green from  NVIDIA, as well as Fabrice Neyret from CNRS and Elmar Eisemann from ParisTech.

    The paper ha been published at Pacific Graphic. You can find the authors version of it on my research page:

    I also presented this work as a talk at Siggraph 2011 in Vancouver as well as a poster at I3D 2011.

    [Update 30/06/2011]: Benoit Rogez wrote a very good article about our approach on 3DVF (in French) : here
    [Update 10/06/2012]: More details on the algorithm and the fast voxelization and octree construction can be found in my GTC 2012 presentation.

    Indirect lighting (global illumination):
    High quality video:

    Ambient occlusion:
    High quality video:

    The vicious circle of generalization @AltDevBlogADay

    Full story here :-D

    This can be described as vicious circle of generalization:
    1. We want the code to be as generalized as possible.
    2. We design everything future-proof and extendible.
    3. When a feature request arrives, we’re doomed we need to change a lot of code.
    4. Why?
    5. Because everything was designed as generalized as possible.
    6. goto 1;

    GDC 2011 Technical Presentations

    GDC 2011 is over now and presentations start to appear on-line :-)

    Khronos on OpenGL:

    More links can be found on this blog:

    Season of next-gen game engines

    This is the season of the new game engines : Dice Frostbite 2, Crytek CryEngine 3 and Epic Unreal Engine 3 !

    CUDA 4.0 RC Released to Registered Developers

    Here it is, CUDA 4.0 RC just got released to NVIDIA Registered developers.

    Interesting stuff from the CUDA manual:

    • Layered Textures Support (GL_TEXTURE_1D/2D_ARRAY)  : New tex.a1d/.a2d modifiers in PTX. But unfortunately the surface instruction do not support them yet, Grrrr
      Layered textures are created using cudaMalloc3DArray() with the cudaArrayLayered flag. New cudaTextureType2DLayered/ cudaTextureType2DLayered texture sampler types and tex1DLayered()/tex2DLayered() access intrinsics.
    • New .address_size PTX specifier : Allows  to specify the address size (32b/64b) used throughout a PTX module.
    • Inline PTX assembly: This feature was already present since CUDA 2.x but was not officially supported. It's now fully supported and documented :-D
    • Driver API, new thread-safe stateless launch API function cuLaunchKernel(): cuLaunchKernel(kernelObj,   blocksPerGrid, 1, 1,   threadsPerBlock, 1, 1,   0, 0, args, 0);
    • FERMI ISA documented and supported by cuobjdump.
    • Enhanced C++: Support for operators new  and  delete, virtual functions.

    CUDA 4.0 announced @GDC 2011

    Today NVIDIA announced CUDA 4.0 at the GDC. It will be available to registered developers on March 4th :-D
    Among interesting novelties, there is the support for layered textures (GL_TEXTURE_2D_ARRAY) that I hope will also be supported for surface access !
    There is also the support for direct peer-to-peer communication between GPUs and mappping multiple GPUs (and potentially other third party devices like network/infiniband) memory into the same address space in order to provide direct memory access (Unified Virtual Addressing, UVA). Virtual functions should also now be supported, along with the New and Delete functions for dynamic memory allocations from kernels.
    Looking forward to test all of this !

    More info : Anandtech, NVIDIA pressroom, Dr Dobbs

    Update: CUDA 4.0 RC released to registered developers
    Slides are available there:
    Among the interesting novelties I did not see before, it seems inline PTX will be officially supported with this release ! Also the dissasembler (cuobjdump) that were previously limited to Tesla ISA now support Fermi ISA disassembly. Take a look as the manual for the list of supported instructions.

    Writing a PhD thesis using FreeMind and LateX : My FreeMind Latex import plug-in

    I am currently in the process of writing my PhD thesis, and I discovered that relying on a mind mapping software can in fact revealed very useful especially at the beginning of the process. Indeed, I found mind mapping softwares like FreeMind very useful tools to graphically organize ideas into a hierarchy and that allows you to quickly and very efficiently construct and update the outline of your thesis. 
    It is actually easy to export a mindmap from FreeMind into a hierarchy of latex sections, in order to directly get your document structure ready to fill. 
    I am actually using this XSLT definition to do so.

    The only problem when working like this is that once you have exported your document structure from Freemind into latex, you have made modifications and written some sections into your latex, there is no way to import it back to FreeMind in order to reorganize your document.

    To allow this usage, I wrote a small groovy script for FreeMind that read a latex file and build the tree hierarchy out of it. It also imports the content of the sections into the text attribute of the nodes, and this content gets correctly exported by the latex export XSLT.

    This script must put into the FreeMind/plugins/script directory, and the following section must be added to the FreeMind/plugins/ScriptingEngine.xml file in order to get the script visible into the Tools menu:

    Of course, this code is provided with no warranty, but it should work well :-) 
    Have fun !


    Today, I would like to invite you to follow an awesome new participating game development blog called #AltDevBlogADay. This blog has been founded in January by Mike Acton who is engine director at Insomniac Games, and group together many very good game developers who publish one article each day on the website.

    Here is a short list of the article I particularly liked and I advice you to take a look at, among those already published:

    January pop stack post !

    I have not been posting here for a while and I am sorry about this. I have been quite busy lately with a paper submission and also the writing of my PhD thesis ! Anyway, I wish everybody an happy new year :-)
    There is a couple of links I stacked lately and I would like to share, so here they are !

    • NVIDIA Announced Project Denver : A fully integrated CPU+GPU chip dedicated to high-performance computing and based on ARM cores. It is designed to power future products ranging from personal computers to servers and supercomputers. It looks like NVIDIA's answer to AMD Fusion and is also clearly a "we don't believe in the future of x86 architectures" launched to the industry. I really believe this is the right approach  to bring high performance CPUs to all market segments. Thanks to the rise of mobile and tablet like devices, we may finally see the end of the hegemony of the x86 architecture. I really hope that it will not be the last dinosaur from NVIDIA ;-)
    • gDebugger is now free !
    • “Principles of Digital Image Synthesis” book is now free for download. And a nicely formated PDF can be found here.
    • Data-Oriented Design - Links and Thoughts : The concept of Data-Oriented Design (DOD) as opposed to Object Oriented Programming (OOP) is a concept that is emerging and gaining popularity among the game developers community lately. The idea is that instead of building performance sensitive parts of applications around classes that represent individual objects with ineficient -non-cache coherent- memory access patterns, it is more efficient to build them by thinking in terms of data flows and memory access patterns. That's a concept that may not be unfamiliar to GPU developers :-D
    • Overtime vs Productivity : Two very interesting blog posts by Allan Mckay that discuss on how people work, why we often get burned out and how to avoid this.
    • A lot of very interesting technical documentations about NVIDIA GPUs and coming from reverse engineering of various hardware can be found on pscnv Git repository. Pscnv is a fork of the nouveau project, an open source driver for NVIDIA GPUs.

    "How the GPU works" @c0de517e

    I rediscovered a very good in-depth explanation on how GPU works published in 2008 on c0de517e blog:
    Part 1 Part 2 Part 3

    CUDA "Better Performance at Lower Occupancy" @GTC2010

    A friend point me this very interesting talk at NVIDIA GTC:
    Better Performance at Lower Occupancy

    They deny two common fallacies that CUDA developer usually believe in:

    • Multithreading is the only way to hide latency on GPU
    • Shared memory is as fast as registers

    All the GTC2010 presentations can be found there (with slides and videos !):

    Old Real-Time GPU Raytracer

    I just translated from French to English an old page on my website about a real-time GPU raytracer I developed for fun 4 years ago, during my Master Thesis. It is old school GPGPU in OpenGL and Cg that can run on an NV40 (GeForce 6800). No need for CUDA or a GF110 to do GPU raytracing ! ;-)
    The application also features a slow and unoptimized CPU raytracer.

    See there:

    PS: It is funny to see what was possible at this time, but it was developed quickly and the shader code itself is not a reference !

    Fluid Simulation for Video Games @INTEL

    There is a very interesting series of article about fluid simulations for video games written by Michael J. Gourlay on intel developer website. Source code is also provided.
    Parts: 1, 2, 3, 4, 5, 6, 7, 8

    CUDA 3.2 Final released

    Download it on NVIDIA Developer website !

    NVIDIA Fermi GPU and Architecture Analysis @Beyond3D

    The article is 3 weeks old but I just read it. Beyond3D published a very good analysis of the Fermi architecture. It is based on many homemade tests they developed to bench individual parts of the GF100 chip. Based on these analysis, they made interesting discoveries and speculations on the GF100 architecture.

    In this article, I also discovered "Pomegranate", a parallel hardware architecture for polygon rendering developed at Stanford and that seems to be very close to the way Fermi handle parallel work distribution of the different steps of the graphics pipeline. Pomegranate [Eldrige et al, 2000]

    Discussions are on Beyond3D Forum.

    Here are some interesting statements:

    Read more »

    First reviews of the NVIDIA GF110: GTX580

    The GF110 is the new high-end GPU from NVIDIA based on a renewed Fermi architecture. Even if the chip has not been officially launched, reviews starts already to appear online !

    In french:

    It seems reviews are pretty good !
    To sum-up, full speed FP16 texture filtering, Z-cull performances improved, architectural tweaks, 15-20% performance improvements over GTX480 in games, less power consumption, quieter, cooler.
    According to techreport, an interesting subtle change is that the 16/48KB local storage partition can be configured by the driver for graphics contexts, while it was only configurable in compute on the GF100.

    Congrats NVIDIA :-)

    Texture and buffer access performance on Evergreen architecture

    A very interesting article about textures and buffer access performances in OpenGL on AMD Evergreen architecture:

    Various stuff from October

    I did not have much time to update this blog lately, so here are some interesting stuff I did not post during October:

    OpenGL SuperBible Fifth Edition

    Last week, Addison Wesley kindly sent me a copy of the Fifth Edition of the OpenGL SuperBible so that I can write a review of it. So let's do that :-)

    The OpenGL SuperBible has been a reference book since the first release and this fifth edition is the first edition to be exclusively focused on modern, shader based, OpenGL programming. That's the great novelty of this edition: it is based on the OpenGL 3.3 API and all discussions about deprecated fixed-function programming has been thrown out of the book.

    Read more »
    Copyright © Icare3D Blog
    Designed by Templates Next | Converted into Blogger Templates by Theme Craft