NVIDIA released a beta version of the CUDA 3.1 toolkit for register developers.
New features from the programming guide :
- 16bits float textures supported by the runtime API. __float2half_rn() and __half2float() intrinsic added (Table C-3).
- Surface memory interface exposed in the runtime API (Section 3.2.5, B9). Read/Write access into textures (CUDA Arrays). But limited to 1D and 2D Arrays yet.
- Up to 16 parallel kernel launches on Fermi (it was only 4 in CUDA 3.0). Not sure how it is really implemented (one per SM ? multiple per SM ?).
- Recursive calls supported in device function on Fermi (B.1.4). Stack size query and setting functions added (cudaThreadGetLimit(), cudaThreadSetLimit()).
- Function pointers supported on device functions on Fermi (B.1.4). Function pointers to global functions supported on all GPUs.
- Just noticed that a __CUDA_ARCH__ macro allowing to write different code paths depending on the architecture (or code executed on the host) is here since CUDA 3.0 (B.1.4).
- printf support into kernels integrated into the API for sm_20 (B.14). Note that a cuprintf supporting all architectures was provided to register developers a few months ago.
- New __byte_perm(x,y,s) intrinsic (C.2.3).
- New __forceinline__ function qualifier to force inlining on Fermi. A __noinline__ was also present already to allow forcing function call on sm_1.x
- New –dlcm compilation flag to specify global memory caching strategy on Fermi (G.4.2).
Interesting new stuff in the Fermi Compatibility Guide:
- Just-in-time kernel compilation can be used with the runtime API with R195 drivers (Section 1.2.1).
- Details using the volatile keyword for intra-warp communications (Section 1.2.2).
Interesting new stuff in the Best Practice Guide:
- Uses signed integer instead of unsigned as loop counter. It allows the compiler to perform strength reduction and can provides better performances (Section 6.3).