Here it is, CUDA 4.0 RC just got released to NVIDIA Registered developers.

Interesting stuff from the CUDA manual:

  • Layered Textures Support (GL_TEXTURE_1D/2D_ARRAY)  : New tex.a1d/.a2d modifiers in PTX. But unfortunately the surface instruction do not support them yet, Grrrr
    Layered textures are created using cudaMalloc3DArray() with the cudaArrayLayered flag. New cudaTextureType2DLayered/ cudaTextureType2DLayered texture sampler types and tex1DLayered()/tex2DLayered() access intrinsics.
  • New .address_size PTX specifier : Allows  to specify the address size (32b/64b) used throughout a PTX module.
  • Inline PTX assembly: This feature was already present since CUDA 2.x but was not officially supported. It's now fully supported and documented :-D
  • Driver API, new thread-safe stateless launch API function cuLaunchKernel(): cuLaunchKernel(kernelObj,   blocksPerGrid, 1, 1,   threadsPerBlock, 1, 1,   0, 0, args, 0);
  • FERMI ISA documented and supported by cuobjdump.
  • Enhanced C++: Support for operators new  and  delete, virtual functions.