Bit hacks and low level algorithms

Here are 3 good places to find bit manipulation hacks and efficient low level algorithms for various mathematical functions:

Bit Twiddling Hacks (Stanford)
The Aggregate Magic Algorithms (University of Kentucky)
HAKMEM (MIT)

If you have other ones like this, do not hesitate to post them in the comments !

Weta Digital Experience



I just come back from 3 months in New-Zealand, working with Weta Digital. It was great, a very nice and interesting experience !

NVIDIA GT200 microbenchmarking

A crazy paper from university of Toronto:
Demystifying GPU Microarchitecture through Microbenchmarking

This  work  develops  a  microbechmark  suite  and  measures  the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. Various undisclosed characteristics of the processing elements and the memory hierarchies are measured.

CUDA Template Metaprogramming

CUDA is awesome and, for me, one of the reason I think it is better than OpenCL is because of its support of C++ templates.

I have been using templates in CUDA for quite a long time now, and in addition to the classical "generic programming" advantages (generic types, functors...), using templates allows for a lot of optimizations in kernel functions.

First, templated values ( template<uint i>... ) can be used as compile time constants. For instance, blockDim is very often known and fixed at compile time. Passing it through a templated value instead of relying on the built-in variable allows faster access, since its value is directly integrated as a constant in the asm. The compiler can optimize some operations, if the constant is a power of two for instance, multiplications and divisions will be transformed into bit-shifts.

Even more interestingly, you can help the compiler in many cases where it would not optimize itself, by implementing the optimizations yourself using template evaluation. Such usage of templates is called template metaprogramming. C++ templates are turing-complete, that means you can implement any computation you want so that it will be evaluated at compile time by the template processor.
For instance, I am not sure the compiler will detect when you are passing a constant to a function like log2(). But you can implement the compile time (recursive) evaluation of log2 very easily with templates:



Template Metaprogramming libraries exist and provide a lot of very advanced and powerful features. I am personally using Loki that is the library written by Andrei Alexandrescu as part of his (awesome) book Modern C++ Design. I am mainly using Typelist and Type manipulation features and they compile perfectly with CUDA 2.3.

Nature's renderer is awesome


Some impressive pictures of the Eyjafjallajokull volcano currently freezing European air traffic:
http://www.boston.com/bigpicture/2010/04/more_from_eyjafjallajokull.html

Even if it is so beautiful, it would be nice that this pretty little volcano stops doing it's teenage angst so that I can come back to my home place next week !

CUDA: Beware of the structs...

... and unions in local variables, they eat kittens !

PS: And in many situations, they also fall down in local memory. So if you are writing a ray tracer, do not use a Ray structure !

Three big lies (of Software Development)

Insomniac Games Head to Engine Director Mike Acton  @GDC 2010

(Lie #1) Software is a platform

"The reality is software is not a platform. You can't idealize the hardware. And the constants in the "Big-O notation" that are so often ignored, are often the parts that actually matter in reality(...) You can't judge code in a vacuum. Hardware impacts data design."

(Lie #2) Code should be designed around a model of the world
"There is no value in code being some kind of model or map of an imaginary world (...) it is extremely popular. If there's a rocket in the game, rest assured that there is a "Rocket" class (...) which contains data for exactly one rocket and does rockety stuff (...) There are a lot of performance penalties for this kind of design, the most significant one is that it doesn't scale. At all. One hundred rockets costs one hundred times as much as one rocket. And it's extremely likely it costs even more than that !"

(Lie #3) Code is more important than data
"Code is ephimiral and has no real intrinsic value. The algorithms certainly do, sure. But the code itself isn't worth all this time (...). The code, the performance and the features hinge on one thing - the data."

http://cellperformance.beyond3d.com/articles/2008/03/three-big-lies.html
http://www.insomniacgames.com/assets/filesthreebiglies2010.pdf

CUDA "volatile trick"

A very useful trick found on the CUDA forum.

Very often, the CUDA compiler inline the operations needed to compute the value of a variable used at several places, instead of keeping the variable in a register. This can be a good strategy in some situations, but there is also many cases where it brings register usage up unnecessarily and duplicates instructions. To prevent this, the "volatile" keyword can be used when the variable is declared, forcing it to be really kept and reused.
This trick also work with constant variables (and shared memory) which would otherwise get loaded into registers over and over when accessed at several places.

It clearly reduces the number of virtual registers allocated at the PTX level, which helps a lot for the real register allocation phase that happens later during the transform to cubin. However, be careful not using it with constantly indexed arrays for instance, they would be put in local memory.

More info there:
http://forums.nvidia.com/index.php?showtopic=89573
http://forums.nvidia.com/index.php?showtopic=99209

Mandelbox


After the Mandelbulb (can also see my implementation here), here is the Mandelbox found on fractalforums.com !
http://sites.google.com/site/mandelbox/
http://www.fractalforums.com/3d-fractal-generation/amazing-fractal/

Larrabee New Instructions @drdobbs

A First Look at the Larrabee New Instructions (LRBni), by Michael Abrash


http://www.drdobbs.com/high-performance-computing/216402188

Not that I am passionated by larrabee -and I don't really believe in an x86 GPU-, but it is still interesting to see what choices has been made for the ISA. After the announced cancellation of the first Larrabee chip as a GPU, I have heard rumors saying that it could still be proposed to HPCs (and it seems that the Larrabee GPU could be rescheduled in a few years).

CUDA PTX 2.0 (Fermi) specification released

NVIDIA made available the specification of the PTX 2.0 ISA for Fermi, this can be downloaded there:

Among interesting things I saw :
  • New Texture, Sampler and Surface types: Opaque type for manipulating texture, sampler and surface descriptor as normal variables. -> More flexible texture manipulation, allow arrays of textures for instance.
  • New syntax for abstracting an underlining ABI (Application Binary Interface): define a syntax for function definition/calls, parameter passing, variadic functions, and dynamic memory allocation in the stack ("alloca"). -> true function calls, and recursivity ! But not yet implemented in CUDA 3.0.
  • New binary instructions  popc (population count, number of one bits), clz (count leading zeros), bfind (non significant non-sign bit), brev (bit reverse), bfe/bfi (bit field extract/insert, ?), prmt (permute)
  • Cache operators (8.7.5.1): Allow to select (per operation) the level of caching in the cache hierarchy (L1/L2) of the load/store instructions.
  • Prefetch instructions (Table 84) that allows forcing the load of a page in global/local memory into a specific cache level).
  • Surface load/store (surd/sust, Tables 90/91): Read/Write (through ROPs ?) into render targets. (Support 3D R/W! Hum.. really working ?)
  • Video instructions: Vector operations on bytes/half-words/words.
  • Performance tuning directives (10.3): Allows to help the compiler to optimize the code based on bloc configurations.

ARM CPUs analysis @bsn

Interesting analysis of ARM last architectures and comparison with x86:

http://www.brightsideofnews.com/news/2010/4/7/the-coming-war-arm-versus-x86.aspx

AMD/ATI Stories @anandtech

Anandtech published interesting insights into the strategy of ATI/AMD for it's two last successful architectures, threw interviews with Carrell Killebrew, engineering lead on RV770.

The RV770 Story: Documenting ATI's Road to Success
The RV870 Story: AMD Showing up to the Fight

NVIDIA OpenGL 4.0 driver + extensions

http://news.developer.nvidia.com/2010/04/nvidia-releases-opengl-4-drivers-plus-8-new-extensions.html

Quickly:

  • Fermi/DX11 level shaders extensions (NV_gpu_program5, NV_tesssellation_program5, NV_gpu_shader5)
  • Global memory Load/Store from shaders ! (NV_shader_buffer_load, NV_shader_buffer_store, build upon "Bindless Graphics")
  • Texture Load/Store ! (EXT_shader_image_load_store)
OpenGL is moving forward, yeah \o/

Jan Vlietinck's DirectCompute fluid simulation




http://users.skynet.be/fquake/

It simulates an incompressible fluid (Navier- Stokes differential equations) through the well known velocity advection (Jos Stam method). The simulation of the video runs in a 200x200x200 voxel grid.

Nice work Jan :-)

Overcoming WDDM : Tesla compute-only driver on GeForce

One big pain with CUDA under Windows Vista or Seven is that performances suffers a lot from limits and overheads imposed by the WDDM (Windows Display Driver Model) the driver has to comply to.
This means slower kernel launches, limit on the size of memory allocations and a lot  of constraints that prevents NVIDIA to efficiently implement a lot of features in CUDA.

Tim Murray on the CUDA forum:

"Welcome to WDDM. Kernel launch overhead is ~3us on non-WDDM platforms. On WDDM, it's 40 at a minimum and can potentially be much larger. Considering the number of kernels you're launching in 10ms, that's going to add up."
"WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers! As a result, you can't really do paging in a CUDA app, so you get zero benefit from WDDM. However, because it's the memory manager, we can't just go around it for CUDA because WDDM will assume it owns the card completely, start moving memory, and whoops your CUDA app just exploded. So no, there's not really some magic workaround for cards that can also be used as display."

To overcome this problem, NVIDIA provides a compute-only drivers for Tesla boards. But with little effort it can be also be installed on GeForce.
How to install them on GeForce:
http://oscarbg.blogspot.com/2010/02/about-tesla-computing-driver.html

New blog

After many years using my icare3d personal website as a kind of blog, I finally turned out to Blogger. Publishing things on my site involved too much formating time and I became lazy posting there. The intend of this new blog is to publish more regularly my thoughts and findings about GPUs, parallel programming, and computer graphics.
The first posts are likely to be bunches of old stuffs I did not take time to post earlier.
Hope you will enjoy it !

Copyright © Icare3D Blog
Designed by Templates Next | Converted into Blogger Templates by Theme Craft