Here it is, CUDA 4.0 RC just got released to NVIDIA Registered developers.
Interesting stuff from the CUDA manual:
- Layered Textures Support (GL_TEXTURE_1D/2D_ARRAY) : New tex.a1d/.a2d modifiers in PTX. But unfortunately the surface instruction do not support them yet, Grrrr
Layered textures are created using cudaMalloc3DArray() with the cudaArrayLayered flag. New cudaTextureType2DLayered/ cudaTextureType2DLayered texture sampler types and tex1DLayered()/tex2DLayered() access intrinsics.
- New .address_size PTX specifier : Allows to specify the address size (32b/64b) used throughout a PTX module.
- Inline PTX assembly: This feature was already present since CUDA 2.x but was not officially supported. It's now fully supported and documented :-D
- Driver API, new thread-safe stateless launch API function cuLaunchKernel(): cuLaunchKernel(kernelObj, blocksPerGrid, 1, 1, threadsPerBlock, 1, 1, 0, 0, args, 0);
- FERMI ISA documented and supported by cuobjdump.
- Enhanced C++: Support for operators new and delete, virtual functions.