A guy from Los Alamos compared the performances (between Tesla 2 and Fermi) of output queues using atomic-add on an integer index per queue. First result : 16x speedup on Fermi !
http://forums.nvidia.com/index.php?showtopic=170125

Its is supposedly thanks to the coalescing of atomic operation that may be done in the L2 cache.

He also did another experiment to see if the L2 cache allows combining writes from different blocks into global memory, and it appears to be the case when you have consecutive blocks writing to the same cache line at the same time. Result: 3.25x speedup on Fermi.
http://forums.nvidia.com/index.php?showtopic=170127