A very useful trick found on the CUDA forum.

Very often, the CUDA compiler inline the operations needed to compute the value of a variable used at several places, instead of keeping the variable in a register. This can be a good strategy in some situations, but there is also many cases where it brings register usage up unnecessarily and duplicates instructions. To prevent this, the "volatile" keyword can be used when the variable is declared, forcing it to be really kept and reused.
This trick also work with constant variables (and shared memory) which would otherwise get loaded into registers over and over when accessed at several places.

It clearly reduces the number of virtual registers allocated at the PTX level, which helps a lot for the real register allocation phase that happens later during the transform to cubin. However, be careful not using it with constantly indexed arrays for instance, they would be put in local memory.

More info there:
http://forums.nvidia.com/index.php?showtopic=89573
http://forums.nvidia.com/index.php?showtopic=99209