Icare3D: CUDA 3.1Beta out

CUDA 3.1Beta out

NVIDIA released a beta version of the CUDA 3.1 toolkit for register developers.

New features from the programming guide :

16bits float textures supported by the runtime API. __float2half_rn() and __half2float() intrinsic added (Table C-3).
Surface memory interface exposed in the runtime API (Section 3.2.5, B9). Read/Write access into textures (CUDA Arrays). But limited to 1D and 2D Arrays yet.
Up to 16 parallel kernel launches on Fermi (it was only 4 in CUDA 3.0). Not sure how it is really implemented (one per SM ? multiple per SM ?).
Recursive calls supported in device function on Fermi (B.1.4). Stack size query and setting functions added (cudaThreadGetLimit(), cudaThreadSetLimit()).
Function pointers supported on device functions on Fermi (B.1.4). Function pointers to global functions supported on all GPUs.
Just noticed that a __CUDA_ARCH__ macro allowing to write different code paths depending on the architecture (or code executed on the host) is here since CUDA 3.0 (B.1.4).
printf support into kernels integrated into the API for sm_20 (B.14). Note that a cuprintf supporting all architectures was provided to register developers a few months ago.
New __byte_perm(x,y,s) intrinsic (C.2.3).
New __forceinline__ function qualifier to force inlining on Fermi. A __noinline__ was also present already to allow forcing function call on sm_1.x
New –dlcm compilation flag to specify global memory caching strategy on Fermi (G.4.2).

Interesting new stuff in the Fermi Compatibility Guide:

Just-in-time kernel compilation can be used with the runtime API with R195 drivers (Section 1.2.1).
Details using the volatile keyword for intra-warp communications (Section 1.2.2).

Interesting new stuff in the Best Practice Guide:

Uses signed integer instead of unsigned as loop counter. It allows the compiler to perform strength reduction and can provides better performances (Section 6.3).