Icare3D

Interview on 3DVF

I gave an interview to 3DVF, a great French community website dedicated to image synthesis, and numerical creation in general. The interview is mainly centered around our voxel-based GI approach, as well as my former INRIA research team and computer graphics research in France.

The interview be read here (in French).

0 Comments

Ph.D defended and postoc at NVIDIA Research

Nov 9, 2011 at 11:42 PM Labels: Research

I did not take time to report on this before, but I defended my Ph.D thesis on GigaVoxels at INRIA this summer (my thesis will be soon available there), and I started a one year postdoc at NVIDIA Research in the SF bay area. There, I continue working on voxel representations, of course :-D

Even if I am now an NVIDIA employee, this continue to be a personal blog, all the things I publish here are totally mine and my opinions do not necessarily reflect the ones from my employer !

1 Comment

Real-Time Shadows Book

Sep 30, 2011 at 11:06 AM Labels: Computer Graphics

Just a quick post to recommend you the great book "Real-time shadows" that details, explore and greatly explain a large number of relevant techniques for real-time shadow rendering, which we all know must be one of the most prolific field in computer graphics !

In addition, this book is written by 4 leading experts of the field: Elmar Eisemann, Michael Schwarz, Ulf Assarsson and Michael Wimmer.

For me, it is now THE reference on shadows rendering !

More info on the book website. and a good description on the publisher's website. Of course it can be bought on amazon.

0 Comments

Siggraph 2011 Slides

Sep 13, 2011 at 10:30 PM Labels: Computer Graphics

Here is a quick list of interesting Siggraph 2011 courses and talks available online.

Courses/Talks :
Advances in Real-Time Rendering in Games
Beyond Programmable Shading
State-of-the-Art: Interactive Global Illumination
Destruction and Dynamics for Film and Game Production
Filtering Approaches for Real-Time Anti-Aliasing
Production Volume Rendering
Compiler Technology for Rendering
Liquid Simulation with mesh-based Surface Tracking

Companies:
CRYTEK
NVIDIA Slides , Videos
INTEL

OpenGL/OpenCL :
OpenGL BOF Slides
Introduction to Modern OpenGL Programming
OpenCL BOF Slides

Others:
HPG 2011

Also a great list with more posters and talks is available on Stephen Hill's Blog
There is also my talk on Interactive Indirect Illumination Using Voxel Cone Tracing :-)

0 Comments

OpenGL 4.2 specification released

Aug 8, 2011 at 5:54 PM Labels: OpenGL

Specifications can be found in the GL registry: http://www.opengl.org/registry/

NVIDIA drivers already supporting it as well as new ARB extensions can be found there: http://developer.nvidia.com/opengl-driver

A full and very good review can be found on G-Truc Creation: http://www.g-truc.net/post-0414.html
Among interesting things, the shader_atomic_counters, conservative depth, instanced transformed feedback, the integration of shader_image_load_store and more !

And congrats to Christophe for his first credit in the spec :-D

2 Comments

"Interactive Indirect Illumination Using Voxel Cone Tracing" paper (Pacific Graphics 2011)

Aug 4, 2011 at 5:55 PM Labels: My Codes , Research

I am glad to announce that our paper "Interactive Indirect Illumination Using Voxel Cone Tracing" (cf. my previous post http://blog.icare3d.org/2011/06/interactive-indirect-illumination-and.html) has been accepted at Pacific Graphics 2011 !

You can find the authors version of the paper on my research page :
http://www.icare3d.org/research-cat/publications/interactive-indirect-illumination-using-voxel-cone-tracing.html
On my INRIA webpage:

http://artis.imag.fr/Publications/2011/CNSGE11b/

It is also on my NVIDIA webpage:
http://research.nvidia.com/publication/interactive-indirect-illumination-using-voxel-cone-tracing

Also, don't forget to attend my talk if you are going to Siggraph 2011 !

Tuesday, 9 August 9:00 am - 10:30 am | West Building, Rooms 109/110

2 Comments

Free 3D meshes links

Jun 26, 2011 at 4:58 PM Labels: Computer Graphics

I started to compile links to websites where free 3D models can be found. If you know other good websites, feal free to post them in the comments :-)

Static models and scenes:

Great collection of models for scientific publications on Morgan McGuire webpage: http://graphics.cs.williams.edu/data/meshes.xml
3D Render challenge: http://www.3drender.com/challenges/
Crytek: http://www.crytek.com/cryengine/cryengine3/downloads
Keenan Crane : http://www.cs.caltech.edu/~keenan/models.html
Sibenik model: http://hdri.cgtechniques.com/~sibenik2/download/
AIM@Shape : http://shapes.aimatshape.net/
Characters Creative Common: http://artist-3d.com/
Characters: http://www.3dvalley.com/3d-models/characters
Blender files: http://www.blendswap.com/3D-models/category/featured/
Archive 3D: http://archive3d.net/

Animated models and scenes:

Ingo Wald: http://www.sci.utah.edu/~wald/animrep/
MIT CSAIL 1: http://people.csail.mit.edu/drdaniel/mesh_animation/index.html
MIT CSAIL 2: http://people.csail.mit.edu/drdaniel/dynamic_shape/index.html
MIT Animals and Face: http://people.csail.mit.edu/sumner/research/deftransfer/data.html
Face data: http://grail.cs.washington.edu/software-data/stfaces/index.html
Pants: http://www.ryanmwhite.com/research/cloth_cap.html

6 Comments

Interactive Indirect Illumination and Ambient Occlusion Using Voxel Cone Tracing

Jun 24, 2011 at 3:02 PM Labels: Computer Graphics , Research

I am happy to present you a new approach to compute indirect illumination and ambient occlusion in real-time using a pre-filtered geometry representation stored inside a sparse voxel octree, as well as a new approximate voxel-based cone-tracing. The sparse voxel octree is used as a proxy for occlusion information and indirect illumination, it supports dynamic objects and environments thanks to a new very fast voxelization algorithm that updates the octree structure and pre-filter geometry and lighting information.

This work has been done in collaboration with Miguel Sainz and Simon Green from NVIDIA, as well as Fabrice Neyret from CNRS and Elmar Eisemann from ParisTech.

The paper ha been published at Pacific Graphic. You can find the authors version of it on my research page:
http://www.icare3d.org/research-cat/publications/interactive-indirect-illumination-using-voxel-cone-tracing.html

I also presented this work as a talk at Siggraph 2011 in Vancouver as well as a poster at I3D 2011.

[Update 30/06/2011]: Benoit Rogez wrote a very good article about our approach on 3DVF (in French) : here
[Update 10/06/2012]: More details on the algorithm and the fast voxelization and octree construction can be found in my GTC 2012 presentation.

Indirect lighting (global illumination):
High quality video: http://artis.imag.fr/Membres/Cyril.Crassin/GIVoxels/Siggraph11_GI1.mov

Ambient occlusion:
High quality video: http://artis.imag.fr/Membres/Cyril.Crassin/GIVoxels/Siggraph11_AO1.mov

25 Comments

The vicious circle of generalization @AltDevBlogADay

Apr 5, 2011 at 5:37 AM Labels: Programming

Full story here :-D

This can be described as vicious circle of generalization:

We want the code to be as generalized as possible.
We design everything future-proof and extendible.
When a feature request arrives, we’re doomed we need to change a lot of code.
Why?
Because everything was designed as generalized as possible.
goto 1;

0 Comments

GDC 2011 Technical Presentations

Mar 10, 2011 at 1:02 PM Labels: Computer Graphics

GDC 2011 is over now and presentations start to appear on-line :-)

DICE: http://publications.dice.se/
NVIDIA: http://www.nvidia.com/object/gdc2011.html
AMD: http://developer.amd.com/documentation/presentations/Pages/default.aspx#GDC
INTEL: http://software.intel.com/en-us/articles/intelgdc2011/
Khronos on OpenGL: http://www.khronos.org/library/detail/2011-gdc-opengl

More links can be found on this blog: http://msinilo.pl/blog/?p=724

3 Comments

Season of next-gen game engines

at 12:53 PM Labels: Computer Graphics

This is the season of the new game engines : Dice Frostbite 2, Crytek CryEngine 3 and Epic Unreal Engine 3 !

0 Comments

CUDA 4.0 RC Released to Registered Developers

Mar 4, 2011 at 12:33 PM Labels: CUDA

Here it is, CUDA 4.0 RC just got released to NVIDIA Registered developers.

Interesting stuff from the CUDA manual:

Layered Textures Support (GL_TEXTURE_1D/2D_ARRAY) : New tex.a1d/.a2d modifiers in PTX. But unfortunately the surface instruction do not support them yet, Grrrr
Layered textures are created using cudaMalloc3DArray() with the cudaArrayLayered flag. New cudaTextureType2DLayered/ cudaTextureType2DLayered texture sampler types and tex1DLayered()/tex2DLayered() access intrinsics.

New .address_size PTX specifier : Allows to specify the address size (32b/64b) used throughout a PTX module.
Inline PTX assembly: This feature was already present since CUDA 2.x but was not officially supported. It's now fully supported and documented :-D
Driver API, new thread-safe stateless launch API function cuLaunchKernel(): cuLaunchKernel(kernelObj, blocksPerGrid, 1, 1, threadsPerBlock, 1, 1, 0, 0, args, 0);
FERMI ISA documented and supported by cuobjdump.
Enhanced C++: Support for operators new and delete, virtual functions.

0 Comments

CUDA 4.0 announced @GDC 2011

Feb 28, 2011 at 5:05 PM Labels: CUDA

Today NVIDIA announced CUDA 4.0 at the GDC. It will be available to registered developers on March 4^th :-D

Among interesting novelties, there is the support for layered textures (GL_TEXTURE_2D_ARRAY) that I hope will also be supported for surface access !
There is also the support for direct peer-to-peer communication between GPUs and mappping multiple GPUs (and potentially other third party devices like network/infiniband) memory into the same address space in order to provide direct memory access (Unified Virtual Addressing, UVA). Virtual functions should also now be supported, along with the New and Delete functions for dynamic memory allocations from kernels.

Looking forward to test all of this !

More info : Anandtech, NVIDIA pressroom, Dr Dobbs GPGPU.org

Update: CUDA 4.0 RC released to registered developers
Slides are available there: http://bit.ly/cuda4features
Among the interesting novelties I did not see before, it seems inline PTX will be officially supported with this release ! Also the dissasembler (cuobjdump) that were previously limited to Tesla ISA now support Fermi ISA disassembly. Take a look as the manual for the list of supported instructions.

0 Comments

Writing a PhD thesis using FreeMind and LateX : My FreeMind Latex import plug-in

Feb 19, 2011 at 8:48 PM Labels: General , My Codes

I am currently in the process of writing my PhD thesis, and I discovered that relying on a mind mapping software can in fact revealed very useful especially at the beginning of the process. Indeed, I found mind mapping softwares like FreeMind very useful tools to graphically organize ideas into a hierarchy and that allows you to quickly and very efficiently construct and update the outline of your thesis.

It is actually easy to export a mindmap from FreeMind into a hierarchy of latex sections, in order to directly get your document structure ready to fill.

I am actually using this XSLT definition to do so.

The only problem when working like this is that once you have exported your document structure from Freemind into latex, you have made modifications and written some sections into your latex, there is no way to import it back to FreeMind in order to reorganize your document.

To allow this usage, I wrote a small groovy script for FreeMind that read a latex file and build the tree hierarchy out of it. It also imports the content of the sections into the text attribute of the nodes, and this content gets correctly exported by the latex export XSLT.

My Groovy script: FreeMindLatexImportScript.groovy

This script must put into the FreeMind/plugins/script directory, and the following section must be added to the FreeMind/plugins/ScriptingEngine.xml file in order to get the script visible into the Tools menu:

Of course, this code is provided with no warranty, but it should work well :-)

Have fun !

11 Comments

#AltDevBlogADay

Feb 18, 2011 at 8:22 PM Labels: Computer Graphics

Today, I would like to invite you to follow an awesome new participating game development blog called #AltDevBlogADay. This blog has been founded in January by Mike Acton who is engine director at Insomniac Games, and group together many very good game developers who publish one article each day on the website.

Here is a short list of the article I particularly liked and I advice you to take a look at, among those already published:

0 Comments

January pop stack post !

Jan 18, 2011 at 6:30 PM Labels: General

I have not been posting here for a while and I am sorry about this. I have been quite busy lately with a paper submission and also the writing of my PhD thesis ! Anyway, I wish everybody an happy new year :-)

There is a couple of links I stacked lately and I would like to share, so here they are !

NVIDIA Announced Project Denver : A fully integrated CPU+GPU chip dedicated to high-performance computing and based on ARM cores. It is designed to power future products ranging from personal computers to servers and supercomputers. It looks like NVIDIA's answer to AMD Fusion and is also clearly a "we don't believe in the future of x86 architectures" launched to the industry. I really believe this is the right approach to bring high performance CPUs to all market segments. Thanks to the rise of mobile and tablet like devices, we may finally see the end of the hegemony of the x86 architecture. I really hope that it will not be the last dinosaur from NVIDIA ;-)
gDebugger is now free !
“Principles of Digital Image Synthesis” book is now free for download. And a nicely formated PDF can be found here.
Data-Oriented Design - Links and Thoughts : The concept of Data-Oriented Design (DOD) as opposed to Object Oriented Programming (OOP) is a concept that is emerging and gaining popularity among the game developers community lately. The idea is that instead of building performance sensitive parts of applications around classes that represent individual objects with ineficient -non-cache coherent- memory access patterns, it is more efficient to build them by thinking in terms of data flows and memory access patterns. That's a concept that may not be unfamiliar to GPU developers :-D
Overtime vs Productivity : Two very interesting blog posts by Allan Mckay that discuss on how people work, why we often get burned out and how to avoid this.
A lot of very interesting technical documentations about NVIDIA GPUs and coming from reverse engineering of various hardware can be found on pscnv Git repository. Pscnv is a fork of the nouveau project, an open source driver for NVIDIA GPUs.

1 Comment

"How the GPU works" @c0de517e

Nov 24, 2010 at 9:45 AM Labels: GPU

I rediscovered a very good in-depth explanation on how GPU works published in 2008 on c0de517e blog:
Part 1 Part 2 Part 3

0 Comments

CUDA "Better Performance at Lower Occupancy" @GTC2010

Nov 23, 2010 at 6:26 PM Labels: CUDA

A friend point me this very interesting talk at NVIDIA GTC:
Better Performance at Lower Occupancy

They deny two common fallacies that CUDA developer usually believe in:

Multithreading is the only way to hide latency on GPU
Shared memory is as fast as registers

All the GTC2010 presentations can be found there (with slides and videos !):
http://www.nvidia.com/object/gtc2010-presentation-archive.html

0 Comments

Old Real-Time GPU Raytracer

Nov 19, 2010 at 4:13 PM Labels: My Codes , OpenGL

I just translated from French to English an old page on my website about a real-time GPU raytracer I developed for fun 4 years ago, during my Master Thesis. It is old school GPGPU in OpenGL and Cg that can run on an NV40 (GeForce 6800). No need for CUDA or a GF110 to do GPU raytracing ! ;-)

The application also features a slow and unoptimized CPU raytracer.

See there: http://www.icare3d.org/myprojects/opengl_projects/raytracer_gpu_full_1.0.html

PS: It is funny to see what was possible at this time, but it was developed quickly and the shader code itself is not a reference !

0 Comments

Fluid Simulation for Video Games @INTEL

Nov 18, 2010 at 6:16 PM Labels: Computer Graphics

There is a very interesting series of article about fluid simulations for video games written by Michael J. Gourlay on intel developer website. Source code is also provided.

Parts: 1, 2, 3, 4, 5, 6, 7, 8

1 Comment

CUDA 3.2 Final released

at 3:14 AM Labels: CUDA

Download it on NVIDIA Developer website !
http://developer.nvidia.com/object/cuda_3_2_downloads.html

0 Comments

NVIDIA Fermi GPU and Architecture Analysis @Beyond3D

at 3:12 AM Labels: GPU

The article is 3 weeks old but I just read it. Beyond3D published a very good analysis of the Fermi architecture. It is based on many homemade tests they developed to bench individual parts of the GF100 chip. Based on these analysis, they made interesting discoveries and speculations on the GF100 architecture.

In this article, I also discovered "Pomegranate", a parallel hardware architecture for polygon rendering developed at Stanford and that seems to be very close to the way Fermi handle parallel work distribution of the different steps of the graphics pipeline. Pomegranate [Eldrige et al, 2000]

Discussions are on Beyond3D Forum.

Here are some interesting statements:

0 Comments

First reviews of the NVIDIA GF110: GTX580

Nov 9, 2010 at 3:14 PM Labels: GPU

The GF110 is the new high-end GPU from NVIDIA based on a renewed Fermi architecture. Even if the chip has not been officially launched, reviews starts already to appear online !

www.tomshardware.com/reviews/geforce-gtx-580-gf110-geforce-gtx-480,2781-2.html

http://www.techpowerup.com/reviews/NVIDIA/GeForce_GTX_580/1.html

http://www.guru3d.com/article/geforce-gtx-580-review/

http://www.techreport.com/articles.x/19934

http://www.hardocp.com/article/2010/11/09/nvidia_geforce_gtx_580_video_card_review

http://www.anandtech.com/show/4008/nvidias-geforce-gtx-580/

http://www.bit-tech.net/hardware/graphics/2010/11/09/nvidia-geforce-gtx-580-review/

In french:

http://www.clubic.com/carte-graphique/carte-graphique-nvidia/geforce-gtx-580/article-377106-1-nvidia-geforce-gtx-580-gf110.html

http://www.hardware.fr/articles/806-1/dossier-nvidia-geforce-gtx-580-sli.html

It seems reviews are pretty good !

To sum-up, full speed FP16 texture filtering, Z-cull performances improved, architectural tweaks, 15-20% performance improvements over GTX480 in games, less power consumption, quieter, cooler.

According to techreport, an interesting subtle change is that the 16/48KB local storage partition can be configured by the driver for graphics contexts, while it was only configurable in compute on the GF100.

Congrats NVIDIA :-)

0 Comments

Texture and buffer access performance on Evergreen architecture @rastergrid.com

Nov 3, 2010 at 11:30 AM Labels: GPU , OpenGL

A very interesting article about textures and buffer access performances in OpenGL on AMD Evergreen architecture:
http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/

0 Comments

Various stuff from October

Nov 2, 2010 at 5:00 PM Labels: Computer Graphics , CUDA , GPU , OpenGL , Programming

I did not have much time to update this blog lately, so here are some interesting stuff I did not post during October:

GPU-Assisted Malware : http://www.ics.forth.gr/dcs/Activities/papers/gpumalware.malware10.pdf
Thrust 1.3 released : http://gpgpu.org/2010/10/07/thrust-v1-3-release
OpenGL 4.1 drivers status : g-truc creation
"Can CPUs Match GPUs on Performance with Productivity ?" : IBM Research
GPU Technology Conference Session Video Archive : NVIDIA
EASTL : An implementation of the C++ STL made by EA and optimized for video games usages

0 Comments

OpenGL SuperBible Fifth Edition

at 4:11 PM Labels: OpenGL

Last week, Addison Wesley kindly sent me a copy of the Fifth Edition of the OpenGL SuperBible so that I can write a review of it. So let's do that :-)

The OpenGL SuperBible has been a reference book since the first release and this fifth edition is the first edition to be exclusively focused on modern, shader based, OpenGL programming. That's the great novelty of this edition: it is based on the OpenGL 3.3 API and all discussions about deprecated fixed-function programming has been thrown out of the book.

2 Comments

Visual Transistor-level Simulation of the 6502 CPU

Sep 22, 2010 at 7:54 PM Labels: General

If you have ever dreamed to see a chip operating at a transistor level, take a look at this crazy project presented at Siggraph this year: a transistor level simulator of the 6502 CPU (that was powering the Apple 2) !

Greg James, Barry Silverman, Brian Silverman who are leading this project built the simulator by reverse engineering the chip from high resolution die shots they used to reconstruct the full polygon model of the chip circuits !

They provide the simulator as a javascript applet you can use to program the virtual chip and see the circuits operating : http://visual6502.org/JSSim/index.html

The Siggraph talk can be found there: http://visual6502.org/docs/6502_in_action_14_web.pdf

0 Comments

Realistic 3D projection on a building

at 3:08 PM Labels: General

Awesome !

0 Comments

NVIDIA nextgen Kepler and Maxwell architectures codenames unveiled

Sep 21, 2010 at 9:33 PM Labels: GPU

During the opening keynote of the GPU Technology Conference, Jen-Hsun Huang unvailled the codenames and the roadmap for the next 2 generation of NVIDIA GPU architectures !

So we now publicly know that Johannes Kepler and James Clerk Maxwell are the two next scientists that will succeed to Enrico Fermi !

Sources :

http://forum.beyond3d.com/showthread.php?t=58668

http://www.geeks3d.com/20100921/nvidia-gpu-codenames-for-the-next-three-years/

http://www.clubic.com/carte-graphique/carte-graphique-nvidia/actualite-367356-nvidia-puces-geforce-kepler-maxwell.html

6 Comments

OpenGL 4.2+ speculations @g-truc

Sep 16, 2010 at 12:24 PM Labels: OpenGL

An awesome post from Christophe Riccio about his speculation for next OpenGL releases:
http://www.g-truc.net/post-tech-lastest-full.html#post0330

Here is the summary:

0 Comments

CUDA 3.2 RC

at 11:19 AM Labels: CUDA

Available to registered developers.

Here are the interesting new stuff I found:

Support for malloc() and free() in kernels: dynamic global memory allocation !

syscall linking mechanism

64 bits addressing support in CUDA driver AP: Allows manipulating more than 4GB of device memory.

New System Management Interface (nvidia-smi) for reporting various hardware counters informations

New stream synchronization function cudaStreamSynchronize(): allow GPU-side inter-streams synchronisation.

A set of new calls is available to allow the creation of CUDA devices with interoperability with Direct3D devices that use SLI in AFR (Alternate Frame Rendering)
New flag to driver API texture reference (CU_TRSF_SRGB), which enables sRGB->linear conversion on a read.

Reference manual adds architecture information on GF10x (GF104, GF106, GF108) class hardware (compute capability 2.1)

Changes in PTX ISA 2.2:

Add tld4 (fetch4) instruction for loading a component (r, g, b, or a) from the four texels compising the bilinear interpolation footprint of a given texture location.

Add kernel pointer parameter state space and alignment of the memory being pointed to.

New CUDA Libraries

CUSPARSE, supporting sparse matrix computations.

CURAND, supporting random number generation for both host and device code with Sobel quasi-random and XORWOW pseudo random routines.

0 Comments

Official CUDA disasembler for sm_1.x real ISA

Aug 26, 2010 at 12:54 PM Labels: CUDA , GPU

Yesturday NVIDIA released an official disasembler for sm_1.x (pre-Fermi) real hardware ISA. It's like an official version of DECUDA :-) (that Wladimir stopped to develop)
It takes either an ELF CUDA binary, a cubin or even an exe file, and provides the low level assembly code of the CUDA kernels.
It is only available for registered developer for now, but you can get a little more information the CUDA forum.

That's something a lot of developers have been asking for for a while. That allows to see the impact of optimizations on the real microcode, and it is particularly important for register usage for instance (since registers allocations is done after the PTX level).
Nice NVIDIA finally end up unveiling it's real hardware ISA instructions. AMD is still a little bit ahead on this since the ISA instructions and microcode is available even for the Evergreen architecture (RV870): http://developer.amd.com/gpu/ATIStreamSDK/assets/AMD_Evergreen-Family_ISA_Instructions_and_Microcode.pdf

3 Comments

What exactly is a doctorate @Gizmodo

Aug 20, 2010 at 10:56 AM Labels: General

What is EXACTLY a PhD ? The answer is there :
http://gizmodo.com/5613794/what-is-exactly-a-doctorate

"Keep pushing" !

0 Comments

OpenGL 4.1 review @g-truc + Samples pack

Aug 19, 2010 at 6:38 PM Labels: OpenGL

OpenGL 4.1 Review: http://www.g-truc.net/post-tech-lastest-full.html

Christophe also published an updated version of his sample pack, with a nice table showing how the new stuff works on both NVIDIA and ATI (Beta drivers): http://www.g-truc.net/post-tech-lastest-full.html#post0321

UPDATE: Another good article about OpenGL 4.1: http://rastergrid.com/blog/2010/08/an-introduction-to-opengl-4-1/

0 Comments

Hacking D3D virtual calls + v-table hacking made useful @Humus

Aug 5, 2010 at 10:59 AM Labels: Programming

Direct3D API is a fully C++ object-oriented API and rely on runtime polymorphisms (virtual fonction calls) to be expendable and easily being able to provide different implementations. So all API calls are virtual calls instead of being plain C calls like in OpenGL.
Every slightly experimented C++ developer knows that virtual functions calls introduce overhead and that they should be avoided inside inner loops. Humus shows how these virtual calls can be replaced by standard calls by hacking the API objects v-table in order to keep a plain C pointer on these virtual methods !
http://www.humus.name/index.php?page=Comments&ID=321

I love this kind of hack ! But as Humus explains, D3D (like OpenGL since OpenGL 3.0) do not rely on immediate mode anymore, that means that API calls are usually consuming a slightly large amount of cycle compared to the overhead of a virtual call.
That means that in practice you wont get significant performance gain from this hack, but this is just really cool :-D And this method could still be useful to overcome performance problems in more badly design APIs !

UPDATE : D3D v-table hacking... made useful !
Humus just published another trick that shows how hacking the v-table of the D3D context can be used to... replace the default API calls by your own enhanced calls !
Humus shows how this can be useful to count the number of time an API function is called for instance. This can be done by overwriting the original object v-table pointer with the address of your own v-table. More details there :
http://www.humus.name/index.php?page=Comments&ID=322

0 Comments

Various GPU stuff from Siggraph time

Jul 31, 2010 at 8:02 PM Labels: GPU , OpenGL

NVIDIA Siggraph 2010 presentations available in streaming there.

Among them a very interesting presentation of OpenGL 3.3/4.0/4.1 by Barthold Lichtenbelt and Mark Kilgard.

Siggraph 2010 Khronos OpenGL BOF and OpenCL BOF slides available.

Reference pages for OpenGL 3.3 and OpenGL 4.1 are online on opengl.org !

I already said it, but I love the way OpenGL has been evolving since OpenGL 3.0 ! It really seems to be a real willpower from the vendors to make it again a first class innovative API :-)

OptiX 2.0, and Cg Toolkit 3.0 released by NVIDIA

SM 5 support in Cg at last !
I tried OptiX (previoulsy NVIRT) recently and I was really impressed, especially by the easiness of usage of the "high level" optixu interface. That's really an awesome tool.

0 Comments

OpenGL 4.1 Specifications released + NVIDIA drivers

Jul 26, 2010 at 10:12 PM Labels: OpenGL

The specifications of OpenGL 4.1 just got released by the Khronos group (But why didn't they wait for the OpenGL BOF ??).

It does not bring a lot of new features, but it's still great to see OpenGL quickly evolving ! Direct State Access does not get into the core yet (sorry Christophe ;-), and I am not sure we will get it before OpenGL 5.0...

As usual, NVIDIA is very likely to announce the release of drivers supporting OpenGL 4.1 during the OpenGL BOF :-)

OpenGL.org forum official thread here.

Here are the main new features:

Viewport Array (ARB_viewport_array). This is, for me, the most interesting new feature. It allows to manipulate multiple viewports inside a given render call. Viewports control the behavior of the "viewport transformantion" stage (view space -> window coordinates, scissor test). Multiple viewports can be created and the geometry shader can direct emitted primitives to a selected viewport. A separate viewport rectangle and scissor region can be specified for each viewport.
Ability to get the binary representation of a program object (ARB_get_program_binary). This is a long-awaited feature present in DX for a while.
Separate shader objects (ARB_separate_shader_objects). It allows to compile and to to link a separate program for each shader stage (PS/GS/TCS/TES/FS). A Program Pipeline Object is introduced to manipulate and bind the separate programs. That's also a useful features, and that was the way to do in Cg.
Improved compatibility with OpenGL ES 2.0 (ARB_ES2_compatibility). Adds a few missing functions and tokens.
Support for 64bits vertex attributes in GLSL (ARB_vertex_attrib_64bit).
Increases required size for textures/renderbuffers.

Some interesting new extensions were also released:

ARB_debug_output: Callback mechanisms to receive enhanced errors and warning messages.
ARB_robustness: Address multiple specific goals to improve robustness, for example when running WebGL applications. For instance it provide additional "safe" APIs that bound the amount of data returned by an API query.
ARB_shader_stencil_export: Ability to set stencil values in a fragment shader for enhanced rendering flexibility :-)
ARB_cl_event: Link OpenGL sync objects to OpenCL event objects for enhanced OpenCL interoperability.

UPDATE 27/07: That's done, NVIDIA released it's OpenGL 4.1 drivers ! Everything there.

3 Comments

OpenGL 4.0+ ABuffer V2.0: Linked lists of fragment pages

Jul 19, 2010 at 8:40 PM Labels: My Codes , OpenGL

The main problem with my first ABuffer implementation (cf. my previous post) was that a fixed maximum number of fragments per pixel has to be allocated at initialization time. With this approach, the size of the ABuffer can quickly become very large when the screen resolution and depth complexity of the scene increase.

Using linked lists of fragment pages per pixel

Original basic approach

To try to solve this problem, I implemented a variant of the recent OIT method presented at the GDC2010 by AMD and using per-pixel linked lists. The main difference in my implementation is that fragments are not stored and linked individually but into small pages of fragments (containing 4-6 fragments). Those pages are stored and allocated in a shared pool whose size is changed dynamically depending on the scene demands.
Using pages allows to increase the cache coherency when accessing the fragments, improve the efficiency of concurrent access to the shared pool and decrease the storage cost of the links. This is at the cost of a slight over-allocation of fragments.
The shared pool is composed of a fragment buffer where fragment data is stored, and a link buffer storing links between the pages that are reverse chained. Each pixel of the screen contains the index of the last page it references, as well as a counter with the total number of fragments stored in that pixel (incremented using atomics).
The access to the shared pool is manage through a global page counter, incremented using an atomic operation each time a page is needed by a fragment. The allocation of a page is done by a fragment when it detects that the current page is full, or there is not any page yet for the pixel. This is done inside a critical section to unsure that multiple fragments together in the pipeline and falling into the same pixel will be handled correctly.

ABuffer memory occupancy differences:

Some memory occupancy examples of the fragments storage depending on screen resolution (Basic vs Linked Lists):

512x512: 64MB vs 6.5MB
768x708: 132.7MB vs 11.7MB
1680x988: 405MB vs 27.42MB

The cost of this huge reduction of the storage need is that the rendering speed decreases compared to the basic approach. Linked lists can be down to half the speed of the basic approach when per-fragment additional costs are low, due to the additional memory access and the increased complexity of the fragment shader (more code, more registers). But this cost seems well amortized when the shading costs per-fragment increase.

Order Independent Transparency (OIT) demo application & source code

ZIP file with windows executable, source code and VS2008 project

New keys:

'x' : Switch between ABuffer Algorithms (V1.0 Basic and V2.0 Linked List)
'n' : Display the number of fragments per pixel.
'g' : Swith between Alpha-Blending and Gelly resolve modes.

UPDATE 28/10/2010: Oscarbg did a port of the demo so that it can run on AMD (mainly removing everything related to shader_load/store), more info there:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=285258#Post285258
But sadly still does not work on AMD, so if an AMD guy read that, your help is welcome !
I can't try myself since I don't have any AMD card :-(

22 Comments

CUDA 3.1 Final out

Jun 26, 2010 at 5:48 PM Labels: CUDA

Check it there: http://developer.nvidia.com/object/cuda_3_1_downloads.html
Did not find new features since 3.1beta, just bug fixes. 2D surface write intrinsics that were buggy in 3.1 Beta are now working for instance.
There is also a reference to GPUDirect in the "release highlights", but I did not find any documentation for it. GPUDirect seams to allow third party devices to do direct memory acces to the GPU memory (cf the anouncement here).

2 Comments

CUDA dynamic template parameters (2/2) : Emulation with metaprogramming

Jun 16, 2010 at 4:47 PM Labels: CUDA , My Codes

Following my last post, I will describe how template metaprogramming can be used to "emulate" dynamic kernel template parameters. This approach does not prevent explosion of the amount of generated code at compile time, but provides a simple interface to "transform" runtime variable values into kernel template parameters.

If you are not familiar with templates metaprogramming, I suggest you to read Andrei Alexandrescu's great book: Modern C++ Design: Generic Programming and Design Patterns Applied. This book comes with a small and simple metaprogramming library called Loki that is just a series of helper headers. My code relies on a few structures from it.

Code description
Here is what a kernel call looks like with this mechanism:

The actual kernel call is encapsulated into a functor (here MyKernelCallFunctor) that is passed, with the runtime variables, to the call function of what I called the KernelCaller. The KernelCaller "transform" the set of runtime variables passed to the call function into a set of compile time parameters passed to the functor in the form of a typelist.
The functor is in charge making the actual kernel call by extracting the template values from the typelist, and by passing to the __global__ function the arguments previously passed through it's constructor and stored into the structure. It has to be written for each kernel you want to launch:

The KernelCaller is the core of the mechanism. It operates by recursively adding static values into the typelist by testing the value of the runtime variables. In reallity the compiler will generate code for all the branches and so will instanciate all possible sets of compile time parameters as typelists, the right one being selected at runtime. To ease usage, the runtime variables are passed using a variable argument list and transformed into a vector containing the values.

The compile time recursion is stopped thanks to the number of arguments passed as a template parameter to the KernelCaller. :

Here it is, as is we already have a working mechanism able to deal with arbitrary number of boolean parameters !

Extending the approach to integer parameters

The approach as presented here works well for boolean template parameters. If we want to deal with more values than 0/1 per variable, we need to write something like this:

But this is not very funny to write and not very scalable. So once again we can rely on templates to generate the code for us !
In this case here is what the KernelCaller::call function looks like:

And here is the ValueSelector mechanism:

Loki structures

Here are the Loki structures we rely on:

Full source code

A full running sample code can be downloaded here.

Have fun, and... sorry for this crazy long post !

[Update June 21]
Code updated: There was 2 bugs in the code :

Using a reference as first parameter of a vararg is forbiden by C++ and resulted in a crash on 32bits machines -> transformed into a pointer.
The MaxArgValue argument to the KernelCaller was not kept by the ValueSelector, resulting in the use of the default value of 10 whatever the user specified so it was very slow to compile even for boolean values.

I added two #defines to control the sample kernel call: SAMPLE_NUMBER_OF_PARAMS indicating the number of parameters to use and SAMPLE_NUMBER_OF_INTEGER_VALUES indicating the number of values to use for each paramater (2 for boolean parameters).
I also added a #pragma message in order to follow the recursive compilation.
Dependency on cutil_inline.h removed.

6 Comments

CUDA dynamic template parameters (1/2) : My dream CUDA feature

Jun 15, 2010 at 10:34 PM Labels: CUDA , My Codes

One common problem when using templates to parametrize CUDA kernels (cf. my previous post) is to be able to dynamically select the set of template parameters to use for a call, depending on runtime variables. This usually leads to an exploding number of cascaded if/switch and a lot of code copy/paste to instantiate the whole parameters tree for each kernel call.
This situation is illustrated by the following code for boolean parameters:

In addition to the pain it is to write, such code results in the compilation of an exploding number of versions of the same kernel, one for each instantiated template configuration.

Dynamic template parameters with JIT Kernels compilation

There is a CUDA feature I am dreaming about for a few time now and that would solve both problems: dynamic template parameters. What I mean by this is the ability for a CUDA kernel to accept true C variables (containing runtime values) as integer template parameters. The syntax would simply look like this:

This feature would be implemented by taking advantage of a C-level JIT (Just In Time) kernel compilation (current CUDA JIT compiler operates at the PTX level). It implies recompiling the kernel at runtime with a new set of template parameters each time a value changed. It requires tracking the last value of each parameters so that recompilation happens only when necessary. To be a bit more efficient, generated code could also be cached in some way so that it can be reused.
This would change the kernel compilation paradigm to something closer to the OpenCL compiling model, but while keeping the nice CUDA-C syntax provided by nvcc.
That feature would be very useful, and it would be great if NVIDIA makes CUDA evolves in that direction, or if someone write a JIT CUDA-C compiler that allows that !

Emulating dynamic templates... with templates !

While waiting for that feature, dynamic integer template parameters can be partially emulated today... with template metaprogramming ! The idea is to instantiate the whole parameters tree at compile time using templates, and to select the right one at runtime, based on the variables.

More details coming in the next post !

0 Comments

Fast and Accurate Single-Pass A-Buffer using OpenGL 4.0+

Jun 9, 2010 at 7:06 PM Labels: My Codes , OpenGL

One of the first thing I wanted do try on the GF100 was the new NVIDIA extensions that allows random access read/write and atomic operations into global memory and textures, to implement a fast A-Buffer !

It worked pretty well since it provides something like a 1.5x speedup over the fastest previous approach (at least I know about !), with zero artifact and supporting arbitrary number of layers with a single geometry pass.

Sample application sources and Win32 executable:
Sources+executable+Stanford Dragon model
Additional models

Be aware that this will probably only run on a Fermi card (Forceware drivers >=257.15). In particular it requires: EXT_shader_image_load_store, NV_shader_buffer_load, NV_shader_buffer_store, EXT_direct_state_access.
Application uses freeglut in order to initialize an OpenGL 4.0 context with the core profile.

Keys:

'a' Enable/Disable A-Buffer
's' Enable/Disable fragments sorting. Disable= closest fragment kept during resolve.
'g' Swith between Alpha-Blending and Gelly resolve modes.
'c' Enable/Disable alpha correction when in Alpha-Blending mode.
't' Swith between using textures or global memory for A-Buffer storage.
'1'-'3' Change mesh (requires the additional models).

UPDATE: Files updated with a bug fix.

A-Buffer:
Basically an A-buffer is a simple list of fragments per pixel [Carpenter 1984]. Previous methods to implement it on DX10 generation hardware required multiple passes to capture an interesting number of fragments per pixel. They where essentially based on depth-peeling, with enhancements allowing to capture more than one layer per geometric pass, like the k-buffer, stencil routed k-buffer. Bucket sort depth peeling allows to capture up to 32 fragments per geometry pass but with only 32 bits per fragment (just a depth) and at the cost of potential collisions. All these techniques were complex and especially limited by the maximum of 8 render targets that were writable by the fragment shader.

This technique can handle arbitrary number of fragments per pixels in a single pass, with only limitation the available video memory. In this example, I do order independent transparency with fragments storing 4x32bits values containing RGB color components and the depth.

Technique:
The idea is very simple: Each fragment is written by the fragment shader at it's position into a pre-allocated 2D texture array (or a global memory region) with a fixed maximum number of layers. The layer to write the fragment into is given by a counter stored per pixel into another 2D texture and incremented using an atomic increment (or addition) operation ( [image]AtomicIncWrap or [image]AtomicAdd). After the rendering pass, the A-Buffer contains an unordered list of fragments per pixel with it's size. To sort these fragments per depth and compose them on the screen, I simply use a single screen filling quad with a fragment shader. This shader copy all the pixel fragments in a local array (probably stored in L1 on Fermi), sort them with a naive bubble sort, and then combine them front-to-back based on transparency.

Performances:
To compare performances, this sample also features a standard rasterization mode which renders directly into the color buffer. On the Stanford Dragon example, a GTX480 and 32 layers in the A-Buffer, the technique range between 400-500 FPS, and is only 5-20% more costly than a simple rasterization of the mesh.
I also compared performances with the k-buffer which code is available online (still be careful, it may not be super optimized). On the GTX480, with the same model and shading (and 16 layers), I can get more than a 2x speedup. Based on that results, I strongly believe that it is also close to 1.5x faster than the bucket sort depth peeling, without it's depth collision problems.

EDIT: Artifacts in stencil-routed k-buffer came from a bug in DXUT, images removed. Also added a warning about the performances of the k-buffer OpenGL code from Louis Bavoil page.

EDIT 2: The follow-up of this work using per-pixel linked-lists can also be read there: http://blog.icare3d.org/2010/07/opengl-40-abuffer-v20-linked-lists-of.html

23 Comments

Research, Computer Graphics and GPU

Search:

Pages

About

Blog Archive

Labels

Blog links

Favorite websites

Recent Comments

Followers