A friend point me this very interesting talk at NVIDIA GTC:
Better Performance at Lower Occupancy

They deny two common fallacies that CUDA developer usually believe in:

  • Multithreading is the only way to hide latency on GPU
  • Shared memory is as fast as registers

All the GTC2010 presentations can be found there (with slides and videos !):
http://www.nvidia.com/object/gtc2010-presentation-archive.html