NVIDIA released a beta version of the CUDA 3.1 toolkit for register developers.

New features from the programming guide :

  • 16bits float textures supported by the runtime API. __float2half_rn() and __half2float() intrinsic added (Table C-3).
  • Surface memory interface exposed in the runtime API (Section 3.2.5, B9). Read/Write access into textures (CUDA Arrays). But limited to 1D and 2D Arrays yet.
  • Up to 16 parallel kernel launches on Fermi (it was only 4 in CUDA 3.0). Not sure how it is really implemented (one per SM ? multiple per SM ?).
  • Recursive calls supported in device function on Fermi (B.1.4). Stack size query and setting functions added (cudaThreadGetLimit(), cudaThreadSetLimit()).
  • Function pointers supported on device functions on Fermi (B.1.4). Function pointers to global functions supported on all GPUs.
  • Just noticed that a __CUDA_ARCH__ macro allowing to write different code paths depending on the architecture (or code executed on the host) is here since CUDA 3.0 (B.1.4).
  • printf support into kernels integrated into the API for sm_20 (B.14). Note that a cuprintf supporting all architectures was provided to register developers a few months ago.
  • New __byte_perm(x,y,s) intrinsic (C.2.3).
  • New __forceinline__ function qualifier to force inlining on Fermi. A __noinline__ was also present already to allow forcing function call on sm_1.x
  • New –dlcm compilation flag to specify global memory caching strategy on Fermi (G.4.2).

Interesting new stuff in the Fermi Compatibility Guide:
  • Just-in-time kernel compilation can be used with the runtime API with R195 drivers (Section 1.2.1).
  • Details using the volatile keyword for intra-warp communications (Section 1.2.2).

Interesting new stuff in the Best Practice Guide:
  • Uses signed integer instead of unsigned as loop counter. It allows the compiler to perform strength reduction and can provides better performances (Section 6.3).