Among interesting things, the shader_atomic_counters, conservative depth, instanced transformed feedback, the integration of shader_image_load_store and more !
And congrats to Christophe for his first credit in the spec :-D
I started to compile links to websites where free 3D models can be found. If you know other good websites, feal free to post them in the comments :-)
Static models and scenes:
Full story here :-D
GDC 2011 is over now and presentations start to appear on-line :-)
DICE: http://publications.dice.se/
NVIDIA: http://www.nvidia.com/object/gdc2011.html
AMD: http://developer.amd.com/documentation/presentations/Pages/default.aspx#GDC
INTEL: http://software.intel.com/en-us/articles/intelgdc2011/
Khronos on OpenGL: http://www.khronos.org/library/detail/2011-gdc-opengl
More links can be found on this blog: http://msinilo.pl/blog/?p=724
This is the season of the new game engines : Dice Frostbite 2, Crytek CryEngine 3 and Epic Unreal Engine 3 !
Here it is, CUDA 4.0 RC just got released to NVIDIA Registered developers.
Interesting stuff from the CUDA manual:
Today, I would like to invite you to follow an awesome new participating game development blog called #AltDevBlogADay. This blog has been founded in January by Mike Acton who is engine director at Insomniac Games, and group together many very good game developers who publish one article each day on the website.
Here is a short list of the article I particularly liked and I advice you to take a look at, among those already published:
I rediscovered a very good in-depth explanation on how GPU works published in 2008 on c0de517e blog:
Part 1 Part 2 Part 3
A friend point me this very interesting talk at NVIDIA GTC:
Better Performance at Lower Occupancy
They deny two common fallacies that CUDA developer usually believe in:
Download it on NVIDIA Developer website !
http://developer.nvidia.com/object/cuda_3_2_downloads.html
A very interesting article about textures and buffer access performances in OpenGL on AMD Evergreen architecture:
http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/
I did not have much time to update this blog lately, so here are some interesting stuff I did not post during October:
An awesome post from Christophe Riccio about his speculation for next OpenGL releases:
http://www.g-truc.net/post-tech-lastest-full.html#post0330
Here is the summary:
Available to registered developers.
Here are the interesting new stuff I found:
Yesturday NVIDIA released an official disasembler for sm_1.x (pre-Fermi) real hardware ISA. It's like an official version of DECUDA :-) (that Wladimir stopped to develop)
It takes either an ELF CUDA binary, a cubin or even an exe file, and provides the low level assembly code of the CUDA kernels.
It is only available for registered developer for now, but you can get a little more information the CUDA forum.
That's something a lot of developers have been asking for for a while. That allows to see the impact of optimizations on the real microcode, and it is particularly important for register usage for instance (since registers allocations is done after the PTX level).
Nice NVIDIA finally end up unveiling it's real hardware ISA instructions. AMD is still a little bit ahead on this since the ISA instructions and microcode is available even for the Evergreen architecture (RV870): http://developer.amd.com/gpu/ATIStreamSDK/assets/AMD_Evergreen-Family_ISA_Instructions_and_Microcode.pdf
What is EXACTLY a PhD ? The answer is there :
http://gizmodo.com/5613794/what-is-exactly-a-doctorate
"Keep pushing" !
OpenGL 4.1 Review: http://www.g-truc.net/post-tech-lastest-full.html
Christophe also published an updated version of his sample pack, with a nice table showing how the new stuff works on both NVIDIA and ATI (Beta drivers): http://www.g-truc.net/post-tech-lastest-full.html#post0321
UPDATE: Another good article about OpenGL 4.1: http://rastergrid.com/blog/2010/08/an-introduction-to-opengl-4-1/
Direct3D API is a fully C++ object-oriented API and rely on runtime polymorphisms (virtual fonction calls) to be expendable and easily being able to provide different implementations. So all API calls are virtual calls instead of being plain C calls like in OpenGL.
Every slightly experimented C++ developer knows that virtual functions calls introduce overhead and that they should be avoided inside inner loops. Humus shows how these virtual calls can be replaced by standard calls by hacking the API objects v-table in order to keep a plain C pointer on these virtual methods !
http://www.humus.name/index.php?page=Comments&ID=321
I love this kind of hack ! But as Humus explains, D3D (like OpenGL since OpenGL 3.0) do not rely on immediate mode anymore, that means that API calls are usually consuming a slightly large amount of cycle compared to the overhead of a virtual call.
That means that in practice you wont get significant performance gain from this hack, but this is just really cool :-D And this method could still be useful to overcome performance problems in more badly design APIs !
UPDATE : D3D v-table hacking... made useful !
Humus just published another trick that shows how hacking the v-table of the D3D context can be used to... replace the default API calls by your own enhanced calls !
Humus shows how this can be useful to count the number of time an API function is called for instance. This can be done by overwriting the original object v-table pointer with the address of your own v-table. More details there :
http://www.humus.name/index.php?page=Comments&ID=322
NVIDIA Siggraph 2010 presentations available in streaming there.
Check it there: http://developer.nvidia.com/object/cuda_3_1_downloads.html
Did not find new features since 3.1beta, just bug fixes. 2D surface write intrinsics that were buggy in 3.1 Beta are now working for instance.
There is also a reference to GPUDirect in the "release highlights", but I did not find any documentation for it. GPUDirect seams to allow third party devices to do direct memory acces to the GPU memory (cf the anouncement here).
Following my last post, I will describe how template metaprogramming can be used to "emulate" dynamic kernel template parameters. This approach does not prevent explosion of the amount of generated code at compile time, but provides a simple interface to "transform" runtime variable values into kernel template parameters.
If you are not familiar with templates metaprogramming, I suggest you to read Andrei Alexandrescu's great book: Modern C++ Design: Generic Programming and Design Patterns Applied. This book comes with a small and simple metaprogramming library called Loki that is just a series of helper headers. My code relies on a few structures from it.
Code description
Here is what a kernel call looks like with this mechanism:
The actual kernel call is encapsulated into a functor (here MyKernelCallFunctor) that is passed, with the runtime variables, to the call function of what I called the KernelCaller. The KernelCaller "transform" the set of runtime variables passed to the call function into a set of compile time parameters passed to the functor in the form of a typelist.
The functor is in charge making the actual kernel call by extracting the template values from the typelist, and by passing to the __global__ function the arguments previously passed through it's constructor and stored into the structure. It has to be written for each kernel you want to launch:
The KernelCaller is the core of the mechanism. It operates by recursively adding static values into the typelist by testing the value of the runtime variables. In reallity the compiler will generate code for all the branches and so will instanciate all possible sets of compile time parameters as typelists, the right one being selected at runtime. To ease usage, the runtime variables are passed using a variable argument list and transformed into a vector containing the values.
The compile time recursion is stopped thanks to the number of arguments passed as a template parameter to the KernelCaller. :
Here it is, as is we already have a working mechanism able to deal with arbitrary number of boolean parameters !
One common problem when using templates to parametrize CUDA kernels (cf. my previous post) is to be able to dynamically select the set of template parameters to use for a call, depending on runtime variables. This usually leads to an exploding number of cascaded if/switch and a lot of code copy/paste to instantiate the whole parameters tree for each kernel call.
This situation is illustrated by the following code for boolean parameters:
In addition to the pain it is to write, such code results in the compilation of an exploding number of versions of the same kernel, one for each instantiated template configuration.
One of the first thing I wanted do try on the GF100 was the new NVIDIA extensions that allows random access read/write and atomic operations into global memory and textures, to implement a fast A-Buffer !
It worked pretty well since it provides something like a 1.5x speedup over the fastest previous approach (at least I know about !), with zero artifact and supporting arbitrary number of layers with a single geometry pass.
On Friday I received a GeForce GTX 480 sent by NVIDIA to the lab !
It will nicely replace my dusty GTX 280 :-D
A guy from Los Alamos compared the performances (between Tesla 2 and Fermi) of output queues using atomic-add on an integer index per queue. First result : 16x speedup on Fermi !
http://forums.nvidia.com/index.php?showtopic=170125
Its is supposedly thanks to the coalescing of atomic operation that may be done in the L2 cache.
He also did another experiment to see if the L2 cache allows combining writes from different blocks into global memory, and it appears to be the case when you have consecutive blocks writing to the same cache line at the same time. Result: 3.25x speedup on Fermi.
http://forums.nvidia.com/index.php?showtopic=170127
I always known that day would come !
Source: gpgpu.orgThe June 2010 Top500 list of the world’s fastest supercomputers was released this week at ISC 2010. While the US Jaguar supercomputer (located at the Department of Energy’s Oak Ridge Leadership Computing Facility) retained the top spot in Linpack performance, a Chinese cluster called Nebulae, built from a Dawning TC3600 Blade system with Intel X5650 processors and NVIDIA Tesla C2050 GPUs is now the fastest in theoretical peak performance at 2.98 PFlop/s and No. 2 with a Linpack performance of 1.271 PFlop/s. This is the highest rank a GPU-accelerated system, or a Chinese system, has ever achieved on the Top500 list.
For more information, visit www.TOP500.org.
After Intel officially admitted (through this blog post, also an interesting article here) that Larrabee is not going to play in the consumer gaming market in the "near future", BsN publish an interesting post-mortem article from Andrew Richards : Why Intel Larrabee Really Stumbled: Developer Analysis
At a given point of your PhD, even trees start looking like the Stanford Bunny !
Just discovered a good review of OpenGL 4.0 made by Christophe Riccio "Groovounet", I missed it in March.
http://www.g-truc.net/post-0269.html
NVIDIA released a beta version of the CUDA 3.1 toolkit for register developers.
New features from the programming guide :