When it comes to optimizing data transfers, ensuring that we use pinned memory is critical in CUDA. One way to use such pinned memory is to ask CUDA to allocate host memory with the cudaMallocHost function.
With the UVA (Unified Virtual Addressing) mechanism added in CUDA 4.0, there is an additional behaviour that is worth mentioning: pinned memory allocated with cudaMallocHost is not only a valid host buffer, but it is also mapped in devices' memory. This means that CUDA kernels can directly read or write through the PCI bus using the host address.
For example, the following piece of code shows a kernel that increments a variable mapped in host memory. Note that we use atomic add operations because the latency to access the variable in main memory is so high that many CUDA threads are likely to read the same value concurrently.
Of course this trivial example is by no mean an efficient code, it simply illustrates that we need not always issue a costly pair of cudaMemcpy(Async) operations from the host when accessing very little elements, or with very irregular data access patterns.
#define N 4096
static __global__ void inc_kernel(unsigned long long *cnt)
int main(int argc, char **argv)
unsigned long long *cnt;
cudaMallocHost(&cnt, sizeof(unsigned long long));
*cnt = 0;
for (i = 0; i < N; i++)
assert(*cnt == 16*N);
fprintf(stderr, "CNT %lu\n", *cnt);
NB: the previous code must be compiled with the -arch=sm_20 flag (or higher) to have the atomicAdd function defined.