Accessing pinned host memory directly from the device

When it comes to optimizing data transfers, ensuring that we use pinned memory is critical in CUDA. One way to use such pinned memory is to ask CUDA to allocate host memory with the cudaMallocHost function.

With the UVA (Unified Virtual Addressing) mechanism added in CUDA 4.0, there is an additional behaviour that is worth mentioning: pinned memory allocated with cudaMallocHost is not only a valid host buffer, but it is also mapped in devices' memory. This means that CUDA kernels can directly read or write through the PCI bus using the host address.

For example, the following piece of code shows a kernel that increments a variable mapped in host memory. Note that we use atomic add operations because the latency to access the variable in main memory is so high that many CUDA threads are likely to read the same value concurrently.

Of course this trivial example is by no mean an efficient code, it simply illustrates that we need not always issue a costly pair of cudaMemcpy(Async) operations from the host when accessing very little elements, or with very irregular data access patterns.

C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <stdio.h>
#include <assert.h>
#define N 4096
static __global__ void inc_kernel(unsigned long long *cnt)
{
atomicAdd(cnt, 1);
}
int main(int argc, char **argv)
{
unsigned long long *cnt;
cudaMallocHost(&amp;cnt, sizeof(unsigned long long));
*cnt = 0;
int i;
for (i = 0; i &lt; N; i++)
inc_kernel<<<4, 4>>>(cnt);
cudaThreadSynchronize();
assert(*cnt == 16*N);
fprintf(stderr, "CNT %lu\n", *cnt);
return 0;
}

NB: the previous code must be compiled with the -arch=sm_20 flag (or higher) to have the atomicAdd function defined.

One Response to “Accessing pinned host memory directly from the device”

  1. Martin juillet 16, 2013 at 9:32 #

    Hi Cedric,

    I have tested your code [1] but the cnt is not modified by the kernels.
    The pinned memory variable cnt stays 0!

    $ ./accessing_pinned_memory
    CNT 0
    Assertion failed: (*cnt == 16*N), function main, file accessing_pinned_memory.cu, line 40.
    Abort trap: 6

    I hope, I can get this example working!

    Thanks!

    Martin

    [1] https://github.com/millecker/applications/blob/master/CUDA/PinnedMemory/AccessingPinnedMemory/accessing_pinned_memory.cu