CUDA is awesome and, for me, one of the reason I think it is better than OpenCL is because of its support of C++ templates.

I have been using templates in CUDA for quite a long time now, and in addition to the classical "generic programming" advantages (generic types, functors...), using templates allows for a lot of optimizations in kernel functions.

First, templated values ( template<uint i>... ) can be used as compile time constants. For instance, blockDim is very often known and fixed at compile time. Passing it through a templated value instead of relying on the built-in variable allows faster access, since its value is directly integrated as a constant in the asm. The compiler can optimize some operations, if the constant is a power of two for instance, multiplications and divisions will be transformed into bit-shifts.

Even more interestingly, you can help the compiler in many cases where it would not optimize itself, by implementing the optimizations yourself using template evaluation. Such usage of templates is called template metaprogramming. C++ templates are turing-complete, that means you can implement any computation you want so that it will be evaluated at compile time by the template processor.
For instance, I am not sure the compiler will detect when you are passing a constant to a function like log2(). But you can implement the compile time (recursive) evaluation of log2 very easily with templates:

Template Metaprogramming libraries exist and provide a lot of very advanced and powerful features. I am personally using Loki that is the library written by Andrei Alexandrescu as part of his (awesome) book Modern C++ Design. I am mainly using Typelist and Type manipulation features and they compile perfectly with CUDA 2.3.