A friend point me this very interesting talk at NVIDIA GTC:
Better Performance at Lower Occupancy
They deny two common fallacies that CUDA developer usually believe in:
- Multithreading is the only way to hide latency on GPU
- Shared memory is as fast as registers
All the GTC2010 presentations can be found there (with slides and videos !):
PS: It is funny to see what was possible at this time, but it was developed quickly and the shader code itself is not a reference !
Download it on NVIDIA Developer website !
In this article, I also discovered "Pomegranate", a parallel hardware architecture for polygon rendering developed at Stanford and that seems to be very close to the way Fermi handle parallel work distribution of the different steps of the graphics pipeline. Pomegranate [Eldrige et al, 2000]
Discussions are on Beyond3D Forum.
Here are some interesting statements:
Read more »
A very interesting article about textures and buffer access performances in OpenGL on AMD Evergreen architecture:
I did not have much time to update this blog lately, so here are some interesting stuff I did not post during October:
- GPU-Assisted Malware : http://www.ics.forth.gr/dcs/Activities/papers/gpumalware.malware10.pdf
- Thrust 1.3 released : http://gpgpu.org/2010/10/07/thrust-v1-3-release
- OpenGL 4.1 drivers status : g-truc creation
- "Can CPUs Match GPUs on Performance with Productivity ?" : IBM Research
- GPU Technology Conference Session Video Archive : NVIDIA
- EASTL : An implementation of the C++ STL made by EA and optimized for video games usages
So we now publicly know that Johannes Kepler and James Clerk Maxwell are the two next scientists that will succeed to Enrico Fermi !
An awesome post from Christophe Riccio about his speculation for next OpenGL releases:
Here is the summary:
Available to registered developers.
Here are the interesting new stuff I found:
- Support for malloc() and free() in kernels: dynamic global memory allocation ! This is implemented with a new syscall linking mechanism that seems to allow kernel to be linked to precompiled system calls. Infos on the linking mechanism (.calltargets , .callprototype ) can be found in section 10.3 of the PTX ISA manual. I hope this mechanism will get exposed for user functions in the API !
- 64 bits addressing support in CUDA driver AP: Allows manipulating more than 4GB of device memory.
- New System Management Interface (nvidia-smi) for reporting various hardware counters informations
- New stream synchronization function cudaStreamSynchronize(): allow GPU-side inter-streams synchronisation.
- A set of new calls is available to allow the creation of CUDA devices with interoperability with Direct3D devices that use SLI in AFR (Alternate Frame Rendering)
- New flag to driver API texture reference (CU_TRSF_SRGB), which enables sRGB->linear conversion on a read.
- Reference manual adds architecture information on GF10x (GF104, GF106, GF108) class hardware (compute capability 2.1)
- Add tld4 (fetch4) instruction for loading a component (r, g, b, or a) from the four texels compising the bilinear interpolation footprint of a given texture location.
- Add kernel pointer parameter state space and alignment of the memory being pointed to.
New CUDA Libraries
- CUSPARSE, supporting sparse matrix computations.
- CURAND, supporting random number generation for both host and device code with Sobel quasi-random and XORWOW pseudo random routines.
Yesturday NVIDIA released an official disasembler for sm_1.x (pre-Fermi) real hardware ISA. It's like an official version of DECUDA :-) (that Wladimir stopped to develop)
It takes either an ELF CUDA binary, a cubin or even an exe file, and provides the low level assembly code of the CUDA kernels.
It is only available for registered developer for now, but you can get a little more information the CUDA forum.
That's something a lot of developers have been asking for for a while. That allows to see the impact of optimizations on the real microcode, and it is particularly important for register usage for instance (since registers allocations is done after the PTX level).
Nice NVIDIA finally end up unveiling it's real hardware ISA instructions. AMD is still a little bit ahead on this since the ISA instructions and microcode is available even for the Evergreen architecture (RV870): http://developer.amd.com/gpu/ATIStreamSDK/assets/AMD_Evergreen-Family_ISA_Instructions_and_Microcode.pdf
What is EXACTLY a PhD ? The answer is there :
"Keep pushing" !
OpenGL 4.1 Review: http://www.g-truc.net/post-tech-lastest-full.html
Christophe also published an updated version of his sample pack, with a nice table showing how the new stuff works on both NVIDIA and ATI (Beta drivers): http://www.g-truc.net/post-tech-lastest-full.html#post0321
UPDATE: Another good article about OpenGL 4.1: http://rastergrid.com/blog/2010/08/an-introduction-to-opengl-4-1/
Direct3D API is a fully C++ object-oriented API and rely on runtime polymorphisms (virtual fonction calls) to be expendable and easily being able to provide different implementations. So all API calls are virtual calls instead of being plain C calls like in OpenGL.
Every slightly experimented C++ developer knows that virtual functions calls introduce overhead and that they should be avoided inside inner loops. Humus shows how these virtual calls can be replaced by standard calls by hacking the API objects v-table in order to keep a plain C pointer on these virtual methods !
I love this kind of hack ! But as Humus explains, D3D (like OpenGL since OpenGL 3.0) do not rely on immediate mode anymore, that means that API calls are usually consuming a slightly large amount of cycle compared to the overhead of a virtual call.
That means that in practice you wont get significant performance gain from this hack, but this is just really cool :-D And this method could still be useful to overcome performance problems in more badly design APIs !
UPDATE : D3D v-table hacking... made useful !
Humus just published another trick that shows how hacking the v-table of the D3D context can be used to... replace the default API calls by your own enhanced calls !
Humus shows how this can be useful to count the number of time an API function is called for instance. This can be done by overwriting the original object v-table pointer with the address of your own v-table. More details there :
NVIDIA Siggraph 2010 presentations available in streaming there.
- Among them a very interesting presentation of OpenGL 3.3/4.0/4.1 by Barthold Lichtenbelt and Mark Kilgard.
Siggraph 2010 Khronos OpenGL BOF and OpenCL BOF slides available.
Reference pages for OpenGL 3.3 and OpenGL 4.1 are online on opengl.org !
- I already said it, but I love the way OpenGL has been evolving since OpenGL 3.0 ! It really seems to be a real willpower from the vendors to make it again a first class innovative API :-)
OptiX 2.0, and Cg Toolkit 3.0 released by NVIDIA
- SM 5 support in Cg at last !
- I tried OptiX (previoulsy NVIRT) recently and I was really impressed, especially by the easiness of usage of the "high level" optixu interface. That's really an awesome tool.
It does not bring a lot of new features, but it's still great to see OpenGL quickly evolving ! Direct State Access does not get into the core yet (sorry Christophe ;-), and I am not sure we will get it before OpenGL 5.0...
As usual, NVIDIA is very likely to announce the release of drivers supporting OpenGL 4.1 during the OpenGL BOF :-)
OpenGL.org forum official thread here.
Here are the main new features:
- Viewport Array (ARB_viewport_array). This is, for me, the most interesting new feature. It allows to manipulate multiple viewports inside a given render call. Viewports control the behavior of the "viewport transformantion" stage (view space -> window coordinates, scissor test). Multiple viewports can be created and the geometry shader can direct emitted primitives to a selected viewport. A separate viewport rectangle and scissor region can be specified for each viewport.
- Ability to get the binary representation of a program object (ARB_get_program_binary). This is a long-awaited feature present in DX for a while.
- Separate shader objects (ARB_separate_shader_objects). It allows to compile and to to link a separate program for each shader stage (PS/GS/TCS/TES/FS). A Program Pipeline Object is introduced to manipulate and bind the separate programs. That's also a useful features, and that was the way to do in Cg.
- Improved compatibility with OpenGL ES 2.0 (ARB_ES2_compatibility). Adds a few missing functions and tokens.
- Support for 64bits vertex attributes in GLSL (ARB_vertex_attrib_64bit).
- Increases required size for textures/renderbuffers.
- ARB_debug_output: Callback mechanisms to receive enhanced errors and warning messages.
- ARB_robustness: Address multiple specific goals to improve robustness, for example when running WebGL applications. For instance it provide additional "safe" APIs that bound the amount of data returned by an API query.
- ARB_shader_stencil_export: Ability to set stencil values in a fragment shader for enhanced rendering flexibility :-)
- ARB_cl_event: Link OpenGL sync objects to OpenCL event objects for enhanced OpenCL interoperability.
UPDATE 27/07: That's done, NVIDIA released it's OpenGL 4.1 drivers ! Everything there.
Original basic approach
To try to solve this problem, I implemented a variant of the recent OIT method presented at the GDC2010 by AMD and using per-pixel linked lists. The main difference in my implementation is that fragments are not stored and linked individually but into small pages of fragments (containing 4-6 fragments). Those pages are stored and allocated in a shared pool whose size is changed dynamically depending on the scene demands.
Using pages allows to increase the cache coherency when accessing the fragments, improve the efficiency of concurrent access to the shared pool and decrease the storage cost of the links. This is at the cost of a slight over-allocation of fragments.
The shared pool is composed of a fragment buffer where fragment data is stored, and a link buffer storing links between the pages that are reverse chained. Each pixel of the screen contains the index of the last page it references, as well as a counter with the total number of fragments stored in that pixel (incremented using atomics).
The access to the shared pool is manage through a global page counter, incremented using an atomic operation each time a page is needed by a fragment. The allocation of a page is done by a fragment when it detects that the current page is full, or there is not any page yet for the pixel. This is done inside a critical section to unsure that multiple fragments together in the pipeline and falling into the same pixel will be handled correctly.
ABuffer memory occupancy differences:
Some memory occupancy examples of the fragments storage depending on screen resolution (Basic vs Linked Lists):
- 512x512: 64MB vs 6.5MB
- 768x708: 132.7MB vs 11.7MB
- 1680x988: 405MB vs 27.42MB
- 'x' : Switch between ABuffer Algorithms (V1.0 Basic and V2.0 Linked List)
- 'n' : Display the number of fragments per pixel.
- 'g' : Swith between Alpha-Blending and Gelly resolve modes.
UPDATE 28/10/2010: Oscarbg did a port of the demo so that it can run on AMD (mainly removing everything related to shader_load/store), more info there:
But sadly still does not work on AMD, so if an AMD guy read that, your help is welcome !
I can't try myself since I don't have any AMD card :-(
Check it there: http://developer.nvidia.com/object/cuda_3_1_downloads.html
Did not find new features since 3.1beta, just bug fixes. 2D surface write intrinsics that were buggy in 3.1 Beta are now working for instance.
There is also a reference to GPUDirect in the "release highlights", but I did not find any documentation for it. GPUDirect seams to allow third party devices to do direct memory acces to the GPU memory (cf the anouncement here).
Following my last post, I will describe how template metaprogramming can be used to "emulate" dynamic kernel template parameters. This approach does not prevent explosion of the amount of generated code at compile time, but provides a simple interface to "transform" runtime variable values into kernel template parameters.
If you are not familiar with templates metaprogramming, I suggest you to read Andrei Alexandrescu's great book: Modern C++ Design: Generic Programming and Design Patterns Applied. This book comes with a small and simple metaprogramming library called Loki that is just a series of helper headers. My code relies on a few structures from it.
Here is what a kernel call looks like with this mechanism:
The actual kernel call is encapsulated into a functor (here MyKernelCallFunctor) that is passed, with the runtime variables, to the call function of what I called the KernelCaller. The KernelCaller "transform" the set of runtime variables passed to the call function into a set of compile time parameters passed to the functor in the form of a typelist.
The functor is in charge making the actual kernel call by extracting the template values from the typelist, and by passing to the __global__ function the arguments previously passed through it's constructor and stored into the structure. It has to be written for each kernel you want to launch:
The KernelCaller is the core of the mechanism. It operates by recursively adding static values into the typelist by testing the value of the runtime variables. In reallity the compiler will generate code for all the branches and so will instanciate all possible sets of compile time parameters as typelists, the right one being selected at runtime. To ease usage, the runtime variables are passed using a variable argument list and transformed into a vector containing the values.
The compile time recursion is stopped thanks to the number of arguments passed as a template parameter to the KernelCaller. :
Here it is, as is we already have a working mechanism able to deal with arbitrary number of boolean parameters !
But this is not very funny to write and not very scalable. So once again we can rely on templates to generate the code for us !
In this case here is what the KernelCaller::call function looks like:
And here is the ValueSelector mechanism:
Have fun, and... sorry for this crazy long post !
[Update June 21]
Code updated: There was 2 bugs in the code :
- Using a reference as first parameter of a vararg is forbiden by C++ and resulted in a crash on 32bits machines -> transformed into a pointer.
- The MaxArgValue argument to the KernelCaller was not kept by the ValueSelector, resulting in the use of the default value of 10 whatever the user specified so it was very slow to compile even for boolean values.
I also added a #pragma message in order to follow the recursive compilation.
Dependency on cutil_inline.h removed.
One common problem when using templates to parametrize CUDA kernels (cf. my previous post) is to be able to dynamically select the set of template parameters to use for a call, depending on runtime variables. This usually leads to an exploding number of cascaded if/switch and a lot of code copy/paste to instantiate the whole parameters tree for each kernel call.
This situation is illustrated by the following code for boolean parameters:
In addition to the pain it is to write, such code results in the compilation of an exploding number of versions of the same kernel, one for each instantiated template configuration.
This feature would be implemented by taking advantage of a C-level JIT (Just In Time) kernel compilation (current CUDA JIT compiler operates at the PTX level). It implies recompiling the kernel at runtime with a new set of template parameters each time a value changed. It requires tracking the last value of each parameters so that recompilation happens only when necessary. To be a bit more efficient, generated code could also be cached in some way so that it can be reused.
This would change the kernel compilation paradigm to something closer to the OpenCL compiling model, but while keeping the nice CUDA-C syntax provided by nvcc.
That feature would be very useful, and it would be great if NVIDIA makes CUDA evolves in that direction, or if someone write a JIT CUDA-C compiler that allows that !
More details coming in the next post !
One of the first thing I wanted do try on the GF100 was the new NVIDIA extensions that allows random access read/write and atomic operations into global memory and textures, to implement a fast A-Buffer !
It worked pretty well since it provides something like a 1.5x speedup over the fastest previous approach (at least I know about !), with zero artifact and supporting arbitrary number of layers with a single geometry pass.
Sample application sources and Win32 executable:
Sources+executable+Stanford Dragon model
Be aware that this will probably only run on a Fermi card (Forceware drivers >=257.15). In particular it requires: EXT_shader_image_load_store, NV_shader_buffer_load, NV_shader_buffer_store, EXT_direct_state_access.
Application uses freeglut in order to initialize an OpenGL 4.0 context with the core profile.
- 'a' Enable/Disable A-Buffer
- 's' Enable/Disable fragments sorting. Disable= closest fragment kept during resolve.
- 'g' Swith between Alpha-Blending and Gelly resolve modes.
- 'c' Enable/Disable alpha correction when in Alpha-Blending mode.
- 't' Swith between using textures or global memory for A-Buffer storage.
- '1'-'3' Change mesh (requires the additional models).
Basically an A-buffer is a simple list of fragments per pixel [Carpenter 1984]. Previous methods to implement it on DX10 generation hardware required multiple passes to capture an interesting number of fragments per pixel. They where essentially based on depth-peeling, with enhancements allowing to capture more than one layer per geometric pass, like the k-buffer, stencil routed k-buffer. Bucket sort depth peeling allows to capture up to 32 fragments per geometry pass but with only 32 bits per fragment (just a depth) and at the cost of potential collisions. All these techniques were complex and especially limited by the maximum of 8 render targets that were writable by the fragment shader.
This technique can handle arbitrary number of fragments per pixels in a single pass, with only limitation the available video memory. In this example, I do order independent transparency with fragments storing 4x32bits values containing RGB color components and the depth.
The idea is very simple: Each fragment is written by the fragment shader at it's position into a pre-allocated 2D texture array (or a global memory region) with a fixed maximum number of layers. The layer to write the fragment into is given by a counter stored per pixel into another 2D texture and incremented using an atomic increment (or addition) operation ( [image]AtomicIncWrap or [image]AtomicAdd). After the rendering pass, the A-Buffer contains an unordered list of fragments per pixel with it's size. To sort these fragments per depth and compose them on the screen, I simply use a single screen filling quad with a fragment shader. This shader copy all the pixel fragments in a local array (probably stored in L1 on Fermi), sort them with a naive bubble sort, and then combine them front-to-back based on transparency.
To compare performances, this sample also features a standard rasterization mode which renders directly into the color buffer. On the Stanford Dragon example, a GTX480 and 32 layers in the A-Buffer, the technique range between 400-500 FPS, and is only 5-20% more costly than a simple rasterization of the mesh.
I also compared performances with the k-buffer which code is available online (still be careful, it may not be super optimized). On the GTX480, with the same model and shading (and 16 layers), I can get more than a 2x speedup. Based on that results, I strongly believe that it is also close to 1.5x faster than the bucket sort depth peeling, without it's depth collision problems.
EDIT: Artifacts in stencil-routed k-buffer came from a bug in DXUT, images removed. Also added a warning about the performances of the k-buffer OpenGL code from Louis Bavoil page.
EDIT 2: The follow-up of this work using per-pixel linked-lists can also be read there: http://blog.icare3d.org/2010/07/opengl-40-abuffer-v20-linked-lists-of.html
On Friday I received a GeForce GTX 480 sent by NVIDIA to the lab !
It will nicely replace my dusty GTX 280 :-D
A guy from Los Alamos compared the performances (between Tesla 2 and Fermi) of output queues using atomic-add on an integer index per queue. First result : 16x speedup on Fermi !
Its is supposedly thanks to the coalescing of atomic operation that may be done in the L2 cache.
He also did another experiment to see if the L2 cache allows combining writes from different blocks into global memory, and it appears to be the case when you have consecutive blocks writing to the same cache line at the same time. Result: 3.25x speedup on Fermi.
I always known that day would come !
Source: gpgpu.orgThe June 2010 Top500 list of the world’s fastest supercomputers was released this week at ISC 2010. While the US Jaguar supercomputer (located at the Department of Energy’s Oak Ridge Leadership Computing Facility) retained the top spot in Linpack performance, a Chinese cluster called Nebulae, built from a Dawning TC3600 Blade system with Intel X5650 processors and NVIDIA Tesla C2050 GPUs is now the fastest in theoretical peak performance at 2.98 PFlop/s and No. 2 with a Linpack performance of 1.271 PFlop/s. This is the highest rank a GPU-accelerated system, or a Chinese system, has ever achieved on the Top500 list.
For more information, visit www.TOP500.org.
After Intel officially admitted (through this blog post, also an interesting article here) that Larrabee is not going to play in the consumer gaming market in the "near future", BsN publish an interesting post-mortem article from Andrew Richards : Why Intel Larrabee Really Stumbled: Developer Analysis
At a given point of your PhD, even trees start looking like the Stanford Bunny !
And it's not Photoshopped (I saw it during a hike near Grenoble).
Just discovered a good review of OpenGL 4.0 made by Christophe Riccio "Groovounet", I missed it in March.
NVIDIA released a beta version of the CUDA 3.1 toolkit for register developers.
New features from the programming guide :
- 16bits float textures supported by the runtime API. __float2half_rn() and __half2float() intrinsic added (Table C-3).
- Surface memory interface exposed in the runtime API (Section 3.2.5, B9). Read/Write access into textures (CUDA Arrays). But limited to 1D and 2D Arrays yet.
- Up to 16 parallel kernel launches on Fermi (it was only 4 in CUDA 3.0). Not sure how it is really implemented (one per SM ? multiple per SM ?).
- Recursive calls supported in device function on Fermi (B.1.4). Stack size query and setting functions added (cudaThreadGetLimit(), cudaThreadSetLimit()).
- Function pointers supported on device functions on Fermi (B.1.4). Function pointers to global functions supported on all GPUs.
- Just noticed that a __CUDA_ARCH__ macro allowing to write different code paths depending on the architecture (or code executed on the host) is here since CUDA 3.0 (B.1.4).
- printf support into kernels integrated into the API for sm_20 (B.14). Note that a cuprintf supporting all architectures was provided to register developers a few months ago.
- New __byte_perm(x,y,s) intrinsic (C.2.3).
- New __forceinline__ function qualifier to force inlining on Fermi. A __noinline__ was also present already to allow forcing function call on sm_1.x
- New –dlcm compilation flag to specify global memory caching strategy on Fermi (G.4.2).
Interesting new stuff in the Fermi Compatibility Guide:
- Just-in-time kernel compilation can be used with the runtime API with R195 drivers (Section 1.2.1).
- Details using the volatile keyword for intra-warp communications (Section 1.2.2).
Interesting new stuff in the Best Practice Guide:
- Uses signed integer instead of unsigned as loop counter. It allows the compiler to perform strength reduction and can provides better performances (Section 6.3).
Jeremy Sugerman defended his PhD yesterday, the slides from his talk on GRAMPS can be found there:
Programming Many-Core Systems with GRAMPS
Here is also a talk he did on GRAMPS at UCDavis in February :
A quite old stuff (from January) I just read again. I like this point of view :-)
Here are 3 good places to find bit manipulation hacks and efficient low level algorithms for various mathematical functions:
Bit Twiddling Hacks (Stanford)
The Aggregate Magic Algorithms (University of Kentucky)
If you have other ones like this, do not hesitate to post them in the comments !
I just come back from 3 months in New-Zealand, working with Weta Digital. It was great, a very nice and interesting experience !
A crazy paper from university of Toronto:
Demystifying GPU Microarchitecture through Microbenchmarking
This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. Various undisclosed characteristics of the processing elements and the memory hierarchies are measured.
CUDA is awesome and, for me, one of the reason I think it is better than OpenCL is because of its support of C++ templates.
I have been using templates in CUDA for quite a long time now, and in addition to the classical "generic programming" advantages (generic types, functors...), using templates allows for a lot of optimizations in kernel functions.
First, templated values ( template<uint i>... ) can be used as compile time constants. For instance, blockDim is very often known and fixed at compile time. Passing it through a templated value instead of relying on the built-in variable allows faster access, since its value is directly integrated as a constant in the asm. The compiler can optimize some operations, if the constant is a power of two for instance, multiplications and divisions will be transformed into bit-shifts.
Even more interestingly, you can help the compiler in many cases where it would not optimize itself, by implementing the optimizations yourself using template evaluation. Such usage of templates is called template metaprogramming. C++ templates are turing-complete, that means you can implement any computation you want so that it will be evaluated at compile time by the template processor.
For instance, I am not sure the compiler will detect when you are passing a constant to a function like log2(). But you can implement the compile time (recursive) evaluation of log2 very easily with templates:
Template Metaprogramming libraries exist and provide a lot of very advanced and powerful features. I am personally using Loki that is the library written by Andrei Alexandrescu as part of his (awesome) book Modern C++ Design. I am mainly using Typelist and Type manipulation features and they compile perfectly with CUDA 2.3.
Some impressive pictures of the Eyjafjallajokull volcano currently freezing European air traffic:
Even if it is so beautiful, it would be nice that this pretty little volcano stops doing it's teenage angst so that I can come back to my home place next week !
... and unions in local variables, they eat kittens !
PS: And in many situations, they also fall down in local memory. So if you are writing a ray tracer, do not use a Ray structure !
Insomniac Games Head to Engine Director Mike Acton @GDC 2010
(Lie #1) Software is a platform
"The reality is software is not a platform. You can't idealize the hardware. And the constants in the "Big-O notation" that are so often ignored, are often the parts that actually matter in reality(...) You can't judge code in a vacuum. Hardware impacts data design."
(Lie #2) Code should be designed around a model of the world
"There is no value in code being some kind of model or map of an imaginary world (...) it is extremely popular. If there's a rocket in the game, rest assured that there is a "Rocket" class (...) which contains data for exactly one rocket and does rockety stuff (...) There are a lot of performance penalties for this kind of design, the most significant one is that it doesn't scale. At all. One hundred rockets costs one hundred times as much as one rocket. And it's extremely likely it costs even more than that !"
(Lie #3) Code is more important than data
"Code is ephimiral and has no real intrinsic value. The algorithms certainly do, sure. But the code itself isn't worth all this time (...). The code, the performance and the features hinge on one thing - the data."
A very useful trick found on the CUDA forum.
More info there:
After the Mandelbulb (can also see my implementation here), here is the Mandelbox found on fractalforums.com !
A First Look at the Larrabee New Instructions (LRBni), by Michael Abrash
Not that I am passionated by larrabee -and I don't really believe in an x86 GPU-, but it is still interesting to see what choices has been made for the ISA. After the announced cancellation of the first Larrabee chip as a GPU, I have heard rumors saying that it could still be proposed to HPCs (and it seems that the Larrabee GPU could be rescheduled in a few years).
- New Texture, Sampler and Surface types: Opaque type for manipulating texture, sampler and surface descriptor as normal variables. -> More flexible texture manipulation, allow arrays of textures for instance.
- New syntax for abstracting an underlining ABI (Application Binary Interface): define a syntax for function definition/calls, parameter passing, variadic functions, and dynamic memory allocation in the stack ("alloca"). -> true function calls, and recursivity ! But not yet implemented in CUDA 3.0.
- New binary instructions popc (population count, number of one bits), clz (count leading zeros), bfind (non significant non-sign bit), brev (bit reverse), bfe/bfi (bit field extract/insert, ?), prmt (permute)
- Cache operators (18.104.22.168): Allow to select (per operation) the level of caching in the cache hierarchy (L1/L2) of the load/store instructions.
- Prefetch instructions (Table 84) that allows forcing the load of a page in global/local memory into a specific cache level).
- Surface load/store (surd/sust, Tables 90/91): Read/Write (through ROPs ?) into render targets. (Support 3D R/W! Hum.. really working ?)
- Video instructions: Vector operations on bytes/half-words/words.
- Performance tuning directives (10.3): Allows to help the compiler to optimize the code based on bloc configurations.
Interesting analysis of ARM last architectures and comparison with x86:
Anandtech published interesting insights into the strategy of ATI/AMD for it's two last successful architectures, threw interviews with Carrell Killebrew, engineering lead on RV770.
The RV770 Story: Documenting ATI's Road to Success
The RV870 Story: AMD Showing up to the Fight
- Fermi/DX11 level shaders extensions (NV_gpu_program5, NV_tesssellation_program5, NV_gpu_shader5)
- Global memory Load/Store from shaders ! (NV_shader_buffer_load, NV_shader_buffer_store, build upon "Bindless Graphics")
- Texture Load/Store ! (EXT_shader_image_load_store)
It simulates an incompressible fluid (Navier- Stokes differential equations) through the well known velocity advection (Jos Stam method). The simulation of the video runs in a 200x200x200 voxel grid.
One big pain with CUDA under Windows Vista or Seven is that performances suffers a lot from limits and overheads imposed by the WDDM (Windows Display Driver Model) the driver has to comply to.
This means slower kernel launches, limit on the size of memory allocations and a lot of constraints that prevents NVIDIA to efficiently implement a lot of features in CUDA.
Tim Murray on the CUDA forum:
"Welcome to WDDM. Kernel launch overhead is ~3us on non-WDDM platforms. On WDDM, it's 40 at a minimum and can potentially be much larger. Considering the number of kernels you're launching in 10ms, that's going to add up."
"WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers! As a result, you can't really do paging in a CUDA app, so you get zero benefit from WDDM. However, because it's the memory manager, we can't just go around it for CUDA because WDDM will assume it owns the card completely, start moving memory, and whoops your CUDA app just exploded. So no, there's not really some magic workaround for cards that can also be used as display."
To overcome this problem, NVIDIA provides a compute-only drivers for Tesla boards. But with little effort it can be also be installed on GeForce.
How to install them on GeForce:
After many years using my icare3d personal website as a kind of blog, I finally turned out to Blogger. Publishing things on my site involved too much formating time and I became lazy posting there. The intend of this new blog is to publish more regularly my thoughts and findings about GPUs, parallel programming, and computer graphics.
The first posts are likely to be bunches of old stuffs I did not take time to post earlier.
Hope you will enjoy it !