CUDA kernels launches in the null stream are NOT synchronous

Today I'd like to point out some very common error found in CUDA codes which assume that calling a kernel with stream 0 (either explicitly or implicitly) will result in a synchronous kernel call.

Prior to the introduction of streams in CUDA, users did not have to care about synchronization issues, as everything would typically get serialized. The introduction of streams in CUDA allowed to write asynchronous code, but the null stream was introduced for the sake of backward compatibility, and to provide end-users with an easy way to write synchronous code without explicitly dealing with streams.

The semantic of the null stream implies that CUDA implicitly adds a barrier waiting on all previous CUDA activity (in all other streams) before executing any command enqueued in the null-stream (kernels, memory transfers etc.).

As a result, one very common belief is that launching a kernel with the null stream will result in a synchronous kernel call, and that it should be terminated when the kernel submission call ends.

Here is a typical example of code that can be written, implicitly using the null-stream:

C
1
2
3
4
cudaMemcpy(&d_var, &h_var, sizeof(var), cudaMemcpyHostToDevice);
cuda_kernel<<<256, 256>>>(&d_var);
cudaMemcpy(&h_var, &d_var, sizeof(var), cudaMemcpyDeviceToHost);
cpu_function(h_var);

Likewise, one can enqueue multiple kernels which are implicitly serialized:

C
1
2
3
4
5
6
cudaMemcpy(&d_var, &h_var, sizeof(var), cudaMemcpyHostToDevice);
cuda_kernel_1<<<256, 256>>>(&d_var);
cuda_kernel_2<<<256, 256>>>(&d_var);
cuda_kernel_3<<<256, 256>>>(&d_var);
cudaMemcpy(&h_var, &d_var, sizeof(var), cudaMemcpyDeviceToHost);
cpu_function(h_var);

The trick making the previous codes valid is the actual behavior of cudaMemcpy which guarantees a synchronous behavior with respect to the host:

For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed.

As a result, it is very tempting to write the following piece of code where we assume that cpu_function will execute once cuda_kernel has ended:

C
1
2
cuda_kernel<<<256, 256>>>();
cpu_function();

Even though this code would work for 99% of your codes, programmers must realize that this code is wrong. There is indeed one detail of the programming API which most CUDA programmers (me included) tend to forget:

Kernel launches are asynchronous with respect to the host.

This is true regardless of the use of a null stream or not: in the previous code example, the execution of cpu_function is therefore absolutely not guaranteed to start after the completion of the CUDA kernel (it's even unlikely if the kernel runs for a long time).

In most situations, users would never realize that their code is wrong because kernel's side effects would typically be reflected to the host through the use of explicit memory transfers, which either rely on the synchronous behaviour of cudaMemcpy with respect to the host, or on explicit synchronization mechanisms by the means of the stream API (e.g. with combinations of cudaMemcpyAsync and cudaStreamSynchronize).

But let's now consider an example of situation were CUDA will bite you if you don't take care of the actual semantic of the null-stream. The use of zero-copy mapped memory allows CUDA kernels to directly write into host memory without any explicit data transfers that could have issued a barrier on the host under the hood. (For the sake of clarity, error checks were remove from the following piece of code)

C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
__global__ void kernel(int *var)
{
*var = 42;
}
int main()
{
int *var;
cudaMallocHost(&var, sizeof(*var), cudaHostAllocMapped);
*var = -1;
int *mapped_var;
cudaHostGetDevicePointer(&mapped_var, var, 0);
kernel<<<1, 1>>>(mapped_var);
printf("Value: %d\n", *var);
return 0;
}

If the printf occurred after the kernel, var should be 42, but the displayed value is actually -1! Calling cudaDeviceSynchronize() right after the kernel call however changes the output of printf to 42, as one would usually expect when using the null-semantic.

This example is of course very simplistic, but it illustrates why programmers should not think that CUDA kernels launched with the null-stream are synchronous. Code written nowadays behave as expected in 99% of the situations, but relying on this behavior may ultimately cause really nasty bugs (typically data corruption) when the same piece of CUDA code is used in a slightly different context.

Library designers should be especially cautious if they expect their CUDA algorithms to be usable from very different places. For example, nobody really knows what may happen when your incorrect CUDA library is called directly from the device using the CNP (cuda nested parallelism) feature introduced on the Kepler architecture. If you really care about portability and composability, this is something you should keep in mind when writing code that relies on the null stream.

One Response to “CUDA kernels launches in the null stream are NOT synchronous”

  1. Mike Pelton janvier 22, 2013 at 1:49 #

    Wonderfully clear explanation Cedric and explains a bug I'm seeing. Merci Beaucoup!