NVIDIA Turing Vulkan/OpenGL extensions

NVIDIA just launched the TU102/TU104 (GeForce RTX 2080ti/2080), first GPUs based on the Turing architecture. This new architecture brings hardware ray-tracing acceleration, as well as many other new and really cool graphics features. A good high-level overview of the architecture and new graphics features can be found in the Turing Architecture whitepaper as well as this blog post. Most of these features are exposed through both Vulkan and OpenGL extensions, and I will quickly go through each of them in this post. A big thanks to the many people at NVIDIA who worked hard to provide us with these extensions !

Most features split into a Vulkan or OpenGL -specific extension (GL_*/VK_*), and a GLSL or SPIR-V shader extension (GLSL_*/SPV_*).

    Ray-Tracing Acceleration

    Turing brings hardware acceleration for ray-tracing through dedicated units called RT cores. The RT cores provide BVH traversal as well as ray-triangle intersection. This acceleration is exposed in Vulkan through a new ray-tracing pipeline, associated with a series of new shader stages. This programming model maps the DXR (DirectX Ray-Tracing) model, which is quickly described in this blog post, and this blog post details the Vulkan implementation.

    A GTC 2018 presentation about Vulkan Ray-Tracing can also be found there: http://on-demand.gputechconf.com/gtc/2018/video/S8521/ (Slides here).
    This blog post details

    Mesh Shading

    This is a new programmable geometry pipeline which replaces the traditional VS/HS/DS/GS pipeline with basically a Compute-based programming model. This new pipeline is based on two shader stages, a Task Shader and a Mesh Shader (separated by an expansion stage), which are used to ultimately generate a compact mesh description called a Meshlet. A Meshlet is a mini indexed geometry representation which is maintained on chip and is directly fed to the rasterizer for consumption. This exposes a very flexible and very efficient model with Compute Shader features and generic cooperative thread groups (workgroups, shared memory, barrier synchronizations...). Applications are endless, and this can for instance be used to implement efficient culling or LOD schemes, or perform procedural geometry generation.

    Many details can be found in this excellent blog post by Christoph Kubisch: https://devblogs.nvidia.com/introduction-turing-mesh-shaders/
    As well as in his Siggraph 2018 presentation: http://on-demand.gputechconf.com/siggraph/2018/video/sig1811-3-christoph-kubisch-mesh-shaders.html

    A full OpenGL sample code which implements a compute-based adaptive tessellation technique can also be found there: https://github.com/jdupuy/opengl-framework/tree/master/demo-isubd-terrain

    Variable Rate Shading

    This is a very powerful hardware feature which allows the application to dynamically control the number of fragment shader invocations (independently of the visibility rate) and vary this shading rate across the framebuffer. The shading rate is controlled using a texture image ("Shading Rate Image", 8b/texel) where each texel specifies an independent shading rate for blocks of 16x16 pixels. The rate is actually specified indirectly using 8b indices into a palette which is specified per-viewport and stores the actual shading rate flags.

    Not only the feature allows to vary the MSAA shading rate per-pixel (allowing 1x,4x,8x, and now even 16x SSAA, but with a maximum of 8x depth test and color storage), but it also allows to drop the shading rate below one invocation per-pixel, down to one invocation per block of 4x4 pixels (through one per 1x2, 2x1, 2x2, 2x4 and 4x2 pixels) and even zero invocation. 

    The GLSL extensions also exposes intrinsics allowing fragment shaders to read the effective fragment size in pixels (gl_FragmentSizeNV) as well as the number of fragment shader invocation for a fully covered pixel (gl_InvocationsPerPixelNV). This opens the road to many new algorithms and more efficient implementations of optimized shading rate techniques, like Foveated Rendering, Lens Adaptation (for VR), Content or Motion Adaptive Shading.
    More info on Variable Rate Shading in this blog post: https://devblogs.nvidia.com/turing-variable-rate-shading-vrworks/

    Exclusive Scissor Test

    This adds a second per-viewport scissor test, which culls fragments *inside* (exclusive) the specified rectangle, unlike the standard scissor test which culls *outside* (inclusive). This can be used for instance to implement more efficient multi-resolution foveated-rendering techniques (in conjunction with Variable Rate Shading), where raster passes fill concentric strips of pixels by enabling both inclusive and exclusive scissor tests.

    Texture Access Footprint

    These extensions expose a set of GLSL (and SPIR-V) query functions which report the texture-space footprints of texture lookups, ie. some data identifying the set of all texels that may be accessed in order to return a filtered result for the corresponding texture accesses (which can use anisotropic-filtering and potentially cover large footprints). Footprints are returned and represented as an LOD value, an anchor point and a 64-bit bitfield where each bit represents coverage for a group of neighboring texel (in 2D, group granularity can range from 2x2 to 256x256 texels).

    This is actually an important component for implementing multi-pass decoupled and texture-space shading pipelines, where a restricted set of actually visible pixels must be determined in order to efficiently perform shading in a subsequent pass.

    Derivatives in Compute Shader

    These extensions bring Compute even closer to Graphics by adding support for Quad-based derivatives in Compute Shaders, using the x and y coordinates of the local workgroup invocation ID. This allows Compute Shaders to use both built-in derivative functions like dFdx(), as well as texture lookup functions using automatic LOD computation, and the texture level of detail query function (textureQueryLod()).
    Two layout qualifiers are provided allowing to specify Quad arrangements based on a linear index or 2D indices.

    Shader Subgroup Operations

    These shader extensions provide a series of  ballot-based partitioning and scan/reduce operations which operate on "subgroups" of shader invocations. This can be used for instance to implement clustering and de-duplication operations on sets of values distributed among different shader invocations.

    Barycentric Coordinates and manual attributes interpolation

    Illustration courtesy of Jean-Colas Prunier,
    This feature exposes barycentric coordinates as Fragment Shader input in GLSL (and SPIR-V), and provides the ability for a Fragment Shader to directly fetch raw per-vertex values in order to perform manual barycentric interpolation. 
    A three-component vector built-in input gl_BaryCoordNV provides perspective-corrected barycentric coordinates (gl_BaryCoordNoPerspNV for non- perspective-correct). Per-vertex inputs use the same brackets array syntax as for Tesselation and Geometry Shader inputs, and a pervertexNV qualifier is added to identify input blocs and variables which read raw per-vertex values from the vertices of the original primitive.

    This feature potentially allows more efficient data passing to the Fragment Shader using compact or compressed data formats for instance. It could also allow interpolation from vertex values fetched directly from memory, user defined interpolations, or various reconstructions and computations using raw attributes accessed from the three vertices.

    Ptex Hardware Acceleration

    An corner-sampled image has texels centered on integer coordinates instead of being halfway, which allows edge sampling coordinates to filter to the exact texels on the edge of the texture. This facilitates implementing Ptex (Per-face Texture [Burley and Lacewell 2008], cf. https://developer.nvidia.com/sites/default/files/akamai/gamedev/docs/Borderless%20Ptex.pdf) texturing in real-time applications by providing proper filtering and interpolation. Ptex uses separate textures for each face of a subdivision surface or polygon mesh, and sample locations are placed at pixel corners, maintaining continuity between adjacent patches by duplicating values along shared edges.

    Representative Fragment Test

    This extension has been designed to allow optimizing occlusion queries techniques which rely on per-fragment recording of visible primitives. It allows the hardware to stop generating fragments and stop emitting fragment shader invocations for a given primitive as long as a single fragment has passed early depth and stencil tests. This reduced subset of fragment shader invocation can then be used to record visible primitives in a more performant way. This is only a performance optimization, and no guarantee is given on the number of discarded fragments and consequently the number of fragment shader invocations that will actually be executed.
    A typical use case for this feature can be found in this occlusion culling sample code: https://github.com/nvpro-samples/gl_occlusion_culling

    Multi-View Rendering

    Pascal Simultaneous Multi-Projection (SMP) and stereo view features allowed broadcasting the same geometric data for rasterization to multiple views (up to 2) and viewports (up to 16) with very limited possibilities of per-view (and viewport) variations (an horizontal position offset per-view and xyz swizzle per-viewport + viewport attributes). Turing generalizes and improves over this feature by allowing to specify fully independent per-view attribute values (including vertex positions) and exposes up to 4 views. No new extension is provided, but the feature is exposed transparently as an optimization to the existing standardized Vulkan VK_KHR_multiview and OpenGL GL_OVR_multiview extensions.
    More info on Multi-View Rendering in this bog post: https://devblogs.nvidia.com/turing-multi-view-rendering-vrworks/

    NVIDIA Maxwell OpenGL extensions

    NVIDIA just launched the second-generation Maxwell architecture with the GM204 GPU, which is I believe, an incredible chip. The Maxwell 2 architecture is both highly energy efficient (~2x perf/watt of Kepler in games), and provides a lot of very exciting new graphics features (some of them are exposed in Direct3D). These features are exposed in form of new OpenGL extensions in the R344 driver that was released today, and the specification for all NVIDIA supported GL extensions can be found here. NVIDIA also released new SDK samples using these extensions.

    List of new extensions


    Quick description of the new extensions


    This feature adds a lot of flexibility to the multi-sampled rasterization. It decouples the rasterization sampling frequency (which can be set explicitly) from the actual framebuffer storage. This enables rasterization to operate at higher sampling frequency than the one of the target render color buffers. It supports both depth and stencil testing at this frequency, if the corresponding depth and stencil buffers are sampled accordingly (it must be a multiple of the number of samples in the color buffers).
    There are still some constraints; All color buffers must have the same number of samples, and the raster sample count must match the depth and stencil buffer sample count if depth or stencil test is enabled, and it must be higher or equal to the color buffer sample count.

    A new “coverage reduction stage” is introduced in the per-fragment operations (after the fragment shader in early-z mode, after the depth-test in late-z), which converts a set of covered raster/depth/stencil samples to a set of covered color samples. There is an implementation-dependent association of raster samples to color samples. The reduced "color coverage" is computed such that the coverage bit for each color sample is 1 if any of the associated bits in the fragment's coverage is set, and 0 otherwise. This feature can be used in conjunction with the coverage to color feature (cf. below), in order to get the FS output coverage mask automatically transformed into a color by ROP. According to AnandTech, when rasterizing with explicit multisampling and no render-target, Maxwell allows evaluating primitive coverage at 16x MSAA.

    Note that EXT_raster_multisample is equivalent to "Target-Independent Rasterization" in Direct3D 11.1, which allows using multiple raster samples with a single color sample, as long as depth and stencil tests are disabled, and it is actually a subset of NV_framebuffer_mixed_samples which is more general and exposes more flexibility.

    This allows using ROP to automatically convert the post depth-/stencil-/alpha- test coverage mask into a color and write it into a color render target. This conversion is performed before the new coverage reduction stage (cf. NV_framebuffer_mixed_samples) and can be useful in order to save coverage in the context of deferred shading.

    When operating in early-depth mode (layout(early_fragment_tests) in;, see here for more information), this extension allows the fragment shader to get the post depth-test coverage mask of the current fragment as input (gl_SampleMaskIn[], for which only sample passing the depth-test are set), unlike the standard GL 4.5 behavior which provides the pre- depth-test coverage (actual triangle coverage).

    With standard OpenGL, the Fragment Shader output coverage mask (gl_SampleMask[]) is ANDed with the actual primitive input coverage mask before being used in subsequent pipeline stages. This extension disables this AND operation, which allows the fragment shader to entirely override the primitive coverage, and enables setting coverage bits that are not present in the input mask. This is actually very nice, because it allows using the output coverage as a way to dynamically route color output values into arbitrary sample locations inside a multisampled render target.

    Allows applications to explicitly set the location of sub-pixel samples for multisample rasterization, providing fully programmable sampling patterns. Sampling patterns can be defined within a grid of adjacent pixels, which depends on the number of samples. According to provided queries, the sub-pixel positions are snapped to a 16x16 sub-pixel grid.


    This is a really great feature. It allows Rasterization to generate fragments for any pixel touched by a triangle, even if no sample location is covered on the pixel. A new control is also provided to modify the window coordinate snapping precision in order to allow the application to match conservative rasterization triangle snapping with the snapping that would have occurred at higher resolution. Polygons with zero area generate no fragments. Any location within a pixel may be used for interpolating attributes, potentially causing attribute extrapolation if outside the triangle. This can be useful for binning purpose for instance (using one pixel per-tile).

    This extension exposes an hardware-accelerate critical section for the fragment shader, allowing hazard-free read-modify-write operations on a per-pixel basis. It also allows enforcing primitive-ordering for threads entering the critical section. It provides new GLSL calls beginInvocationInterlockNV() and endInvocationInterlockNV() defining a critical section which is guaranteed to be executed only for one fragment at a time. Interlock can be done on a per-pixel or a per-sample basis if multi-sampled rasterization is used. This feature is useful for algorithms that need to access per-pixel data structures via shader load and store operations, while avoiding race conditions. Obvious applications are OIT and programmable blending for instance.

    This allows rasterizing the axis-aligned screen-space bounding box of submitted triangles, disregarding the actual triangle edges. It can be useful for drawing a full-screen quad without an generating an internal edge for instance, or for more efficiently drawing user interfaces.

    Geometry processing

    This extension allows making geometry shaders more efficient in the case where they are pass-through, ie. there is a one-to-one mapping between input and output primitives. In this case, per-vertex attributes are simply copied from the input primitive into the output primitive, and the geometry shader is only used to set per-primitive attributes (like gl_Layer, gl_ViewportMask[] ... ), which can be computed from the input vertex attributes.

    Viewport multicast allows automatically broadcasting the same primitive to multiple viewports (and/or multiple layers when using layered render-targets) simultaneously, in order to be rasterized multiple times. It is exposed through a new gl_ViewportMask[] GLSL output attribute which is available in both the vertex shader and the geometry shader. This can be especially powerful when combined to the new passthrough geometry shader. A sample using it for speeding-up cascaded shadow maps is available here.


    This extension improves on ARB_sparse_texture, which separates the allocation of virtual address space from the physical memory of textures, and provides the ability to sparsely allocate the physical backing-store of 2D/3D/2DArray textures on a per-tile basis. This new extension adds the ability to retrieve texture access residency information from GLSL, to specify minimum allocated LOD to texture fetches and to return a constant zero value for lookups into unallocated pages. It also adds support for multi-sampled textures.

    This exposes a new sampler parameter which allows performing a min or max reduction operation on the values sampled inside a texture filtering footprint, instead of the regular linear interpolation. It is supported for all kind of textures, as well as anisotropic filtering.


    This extension provides a set of new atomic operations operating on 2 and 4 components vectors of 16b floating point values for images, bindless pointers to global memory and storage buffers.

    Unreal Engine 4 : real-time GI using voxel cone-tracing !

    EPIC games just released two impressive videos presenting their new UE4 game engine featuring real-time global illumination that supports diffuse and specular bounces on dynamic scenes.

    According to this interview of Tim Sweeney, their technique is called SVOGI (Sparse Voxel Octree Global Illumination) and... it is based on the voxel cone-tracing technique we published last year at Pacific Graphics ! :-D

    That's really awesome to see research results being used so quickly in a game engine ! Beyond the paper, you can find more details about our voxel cone-tracing approach in my GTC 2012 talk, my Siggraph talk, as well as in my Ph.D thesis on GigaVoxels.

    GTC 2012 Talk: "Octree-Based Sparse Voxelization for Real-Time Global Illumination"

    This week I gave a talk at the NVIDIA GPU Technology Conference about a fast sparse voxelization technique that I worked on at NVIDIA in the context of my real-time global illumination approach (using voxel cone-tracing inside a sparse voxel octree).

    Video (if you want to enjoy my french accent): http://www.gputechconf.com/

    In this talk I first give an overview of the real-time GI approach, before quickly detailing the new GPU voxelization and octree construction algorithm.

    This sparse voxelization technique will be published in the OpenGL Insights book that will be out for Siggraph.

    [UPDADE 07/12] The book chapter in OpenGL Insights has been published online here.

    Ph.D thesis: GigaVoxels

    I defended my Ph.D thesis on GigaVoxels last July, and the document is now online.

    You can download it there:
    GigaVoxels: A Voxel-Based Rendering Pipeline For Efficient Exploration Of Large And Detailed Scenes

    You can also check my other publications on my Ph.D webpage.

    Interview on 3DVF

    I gave an interview to 3DVF, a great French community website dedicated to image synthesis, and numerical creation in general. The interview is mainly centered around our voxel-based GI approach, as well as my former INRIA research team and computer graphics research in France.

    The interview be read here (in French).

    Ph.D defended and postoc at NVIDIA Research

    I did not take time to report on this before, but I defended my Ph.D thesis on GigaVoxels at INRIA this summer (my thesis will be soon available there), and I started a one year postdoc at NVIDIA Research in the SF bay area. There, I continue working on voxel representations, of course :-D

    Even if I am now an NVIDIA employee, this continue to be a personal blog, all the things I publish here are totally mine and my opinions do not necessarily reflect the ones from my employer !

    Real-Time Shadows Book

    Just a quick post to recommend you the great book "Real-time shadows" that details, explore and greatly explain a large number of relevant techniques for real-time shadow rendering, which we all know must be one of the most prolific field in computer graphics !
    In addition, this book is written by 4 leading experts of the field: Elmar Eisemann, Michael Schwarz, Ulf Assarsson and Michael Wimmer.
    For me, it is now THE reference on shadows rendering !

    More info on the book website. and a good description on the publisher's website. Of course it can be bought on amazon.

    Siggraph 2011 Slides

    Here is a quick list of interesting Siggraph 2011 courses and talks available online.

    Courses/Talks :
    Advances in Real-Time Rendering in Games
    Beyond Programmable Shading
    State-of-the-Art:  Interactive Global Illumination
    Destruction and Dynamics for Film and Game Production
    Filtering Approaches for Real-Time Anti-Aliasing
    Production Volume Rendering
    Compiler Technology for Rendering
    Liquid Simulation with mesh-based Surface Tracking

    NVIDIA Slides , Videos

    OpenGL/OpenCL :
    OpenGL BOF Slides
    Introduction to Modern OpenGL Programming
    OpenCL BOF Slides

    HPG 2011

    Also a great list with more posters and talks is available on Stephen Hill's Blog
    There is also my talk on Interactive Indirect Illumination Using Voxel Cone Tracing :-)

    OpenGL 4.2 specification released

    Specifications can be found in the GL registry: http://www.opengl.org/registry/
    NVIDIA drivers already supporting it as well as new ARB extensions can be found there: http://developer.nvidia.com/opengl-driver

    A full and very good review can be found on G-Truc Creation: http://www.g-truc.net/post-0414.html
    Among interesting things, the shader_atomic_counters, conservative depth, instanced transformed feedback, the integration of shader_image_load_store and more !

    And congrats to Christophe for his first credit in the spec :-D

    "Interactive Indirect Illumination Using Voxel Cone Tracing" paper accepted at Pacific Graphics 2011

    I am glad to announce that our paper "Interactive Indirect Illumination Using Voxel Cone Tracing" (cf. my previous post http://blog.icare3d.org/2011/06/interactive-indirect-illumination-and.html) has been accepted at Pacific Graphics 2011 !

    You can find the authors version of the paper on my research page :
    On my INRIA webpage:

    Also, don't forget to attend my talk if you are going to Siggraph 2011 !
    Tuesday, 9 August 9:00 am - 10:30 am | West Building, Rooms 109/110

    Free 3D meshes links

    I started to compile links to websites where free 3D models can be found. If you know other good websites, feal free to post them in the comments :-)

    Static models and scenes:

    Animated models and scenes:

      Interactive Indirect Illumination and Ambient Occlusion Using Voxel Cone Tracing

      I am happy to present you a new approach to compute indirect illumination and ambient occlusion in real-time using a pre-filtered geometry representation stored inside a sparse voxel octree, as well as a new approximate voxel-based cone-tracing. The sparse voxel octree is used as a proxy for occlusion information and indirect illumination, it supports dynamic objects and environments thanks to a new very fast voxelization algorithm that updates the octree structure and pre-filter geometry and lighting information.

      This work has been done in collaboration with Miguel Sainz and Simon Green from  NVIDIA, as well as Fabrice Neyret from CNRS and Elmar Eisemann from ParisTech.

      The paper ha been published at Pacific Graphic. You can find the authors version of it on my research page:

      I also presented this work as a talk at Siggraph 2011 in Vancouver as well as a poster at I3D 2011.

      [Update 30/06/2011]: Benoit Rogez wrote a very good article about our approach on 3DVF (in French) : here
      [Update 10/06/2012]: More details on the algorithm and the fast voxelization and octree construction can be found in my GTC 2012 presentation.

      Indirect lighting (global illumination):
      High quality video: http://artis.imag.fr/Membres/Cyril.Crassin/GIVoxels/Siggraph11_GI1.mov

      Ambient occlusion:
      High quality video: http://artis.imag.fr/Membres/Cyril.Crassin/GIVoxels/Siggraph11_AO1.mov

      The vicious circle of generalization @AltDevBlogADay

      Full story here :-D

      This can be described as vicious circle of generalization:
      1. We want the code to be as generalized as possible.
      2. We design everything future-proof and extendible.
      3. When a feature request arrives, we’re doomed we need to change a lot of code.
      4. Why?
      5. Because everything was designed as generalized as possible.
      6. goto 1;

      GDC 2011 Technical Presentations

      GDC 2011 is over now and presentations start to appear on-line :-)

      DICE: http://publications.dice.se/
      NVIDIA: http://www.nvidia.com/object/gdc2011.html
      AMD: http://developer.amd.com/documentation/presentations/Pages/default.aspx#GDC
      INTEL: http://software.intel.com/en-us/articles/intelgdc2011/
      Khronos on OpenGL: http://www.khronos.org/library/detail/2011-gdc-opengl

      More links can be found on this blog: http://msinilo.pl/blog/?p=724

      Season of next-gen game engines

      This is the season of the new game engines : Dice Frostbite 2, Crytek CryEngine 3 and Epic Unreal Engine 3 !

      CUDA 4.0 RC Released to Registered Developers

      Here it is, CUDA 4.0 RC just got released to NVIDIA Registered developers.

      Interesting stuff from the CUDA manual:

      • Layered Textures Support (GL_TEXTURE_1D/2D_ARRAY)  : New tex.a1d/.a2d modifiers in PTX. But unfortunately the surface instruction do not support them yet, Grrrr
        Layered textures are created using cudaMalloc3DArray() with the cudaArrayLayered flag. New cudaTextureType2DLayered/ cudaTextureType2DLayered texture sampler types and tex1DLayered()/tex2DLayered() access intrinsics.
      • New .address_size PTX specifier : Allows  to specify the address size (32b/64b) used throughout a PTX module.
      • Inline PTX assembly: This feature was already present since CUDA 2.x but was not officially supported. It's now fully supported and documented :-D
      • Driver API, new thread-safe stateless launch API function cuLaunchKernel(): cuLaunchKernel(kernelObj,   blocksPerGrid, 1, 1,   threadsPerBlock, 1, 1,   0, 0, args, 0);
      • FERMI ISA documented and supported by cuobjdump.
      • Enhanced C++: Support for operators new  and  delete, virtual functions.

      CUDA 4.0 announced @GDC 2011

      Today NVIDIA announced CUDA 4.0 at the GDC. It will be available to registered developers on March 4th :-D
      Among interesting novelties, there is the support for layered textures (GL_TEXTURE_2D_ARRAY) that I hope will also be supported for surface access !
      There is also the support for direct peer-to-peer communication between GPUs and mappping multiple GPUs (and potentially other third party devices like network/infiniband) memory into the same address space in order to provide direct memory access (Unified Virtual Addressing, UVA). Virtual functions should also now be supported, along with the New and Delete functions for dynamic memory allocations from kernels.
      Looking forward to test all of this !

      More info : Anandtech, NVIDIA pressroom, Dr Dobbs GPGPU.org

      Update: CUDA 4.0 RC released to registered developers
      Slides are available there: http://bit.ly/cuda4features
      Among the interesting novelties I did not see before, it seems inline PTX will be officially supported with this release ! Also the dissasembler (cuobjdump) that were previously limited to Tesla ISA now support Fermi ISA disassembly. Take a look as the manual for the list of supported instructions.

      Copyright © Icare3D Blog
      Designed by Templates Next | Converted into Blogger Templates by Theme Craft