Declaring dependencies with cudaStreamWaitEvent

cudaStreamWaitEvent is a very useful synchronization primitive which takes two arguments as input: a stream, and an event. Even if this not clear from its name, this is a non blocking function, all operations enqueued in the stream after calling cudaStreamWaitEvent will only be unlocked when the event is triggered.

A simple example

For example, in the following piece of code, f and g are respectively enqueued in streamA. An event is inserted between the submission of the f and g CUDA kernels, and a dependency between this event and streamB is created before kernel h is enqueued in streamB. As a result, kernel h will not be executed until kernel f ends, but CUDA may execute kernels g and h concurrently. Note that the following piece of code is totally non-blocking.

C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
cudaStream_t streamA, streamB;
cudaEvent_t event;
/* Initialize two streams and an event */
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamB);
cudaEventCreate(&event);
f<<<1,1,0, streamA>>>(arg_f);
cudaEventRecord(event, streamA);
g<<<1,1,0,streamA>>>(arg_g);
cudaStreamWaitEvent(streamB, event, 0);
h<<<1,1, 0, streamB>>>(arg_h);

Fork-Join parallelism

Likewise, we can use this mechanism multiple times to create a Fork-Join parallel pattern. In the following example, we have a parent compute stream that spawns N "children" streams which operate concurrently. Prior to submitting anything in the children streams, a dependency is created between each child and the parent. Once all work has been enqueued in a child stream, a dependency between this child stream and the main stream is created again. As a result, this code implements a Fork-Join paradigm where the all kernels g can only be executed after the completion of kernel f, and where kernel g can only be launched once kernels g are done.

C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
int N = 16;
cudaStream_t parent_stream, child_streams[N];
cudaEvent_t parent_event, child_events[N];
/* Initialize the main stream and enqueue some work in it */
cudaStreamCreate(&parent_stream);
f<<<1,1,0, parent_stream>>>(arg_f);
/* Create a Synchronization point in the parent stream */
cudaEventCreate(&parent_event);
cudaEventRecord(parent_event, parent_stream);
/* Create N concurrent streams */
for (i = 0; i < N; i++)
{
cudaStreamCreate(&child_streams[i]);
cudaStreamWaitEvent(child_streams[i], parent_event, 0);
g<<<1,1,0,child_streams[i]>>>(arg_g);
/* Create a Synchronization point in the child stream */
cudaEventCreate(&child_events[i]);
cudaEventRecord(child_events[i], child_streams[i]);
cudaStreamWaitEvent(parent_stream, child_events[i], 0);
}
/* Enqueue work in the parent stream again, this work will not be
* executed until all work is done in the children. */
h<<<1,1,0, parent_stream>>>(arg_h);

Multi-GPU Fork-Join

Once again, this is a non-blocking piece of code, so that all g kernels can be executed concurrently. Interestingly, this mechanism is also valid between contexts: this means that we can create dependencies between streams on multiple devices.

C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
int ndevices;
cudaGetDeviceCount(&ndevices);
cudaStream_t parent_stream, child_streams[ndevices];
cudaEvent_t parent_event, child_events[ndevices];
/* Initialize the main stream and enqueue some work in it */
cudaSetDevice(0);
cudaStreamCreate(&parent_stream);
init_kernel<<<1,1,0, parent_stream>>>(arg_f);
/* Create a Synchronization point in the parent stream */
cudaEventCreate(&parent_event);
cudaEventRecord(parent_event, parent_stream);
/* Create N concurrent streams */
for (i = 0; i < N; i++)
{
/* Select the device in which to create the stream and which should
* execute the kernel. */
cudaSetDevice(i);
cudaStreamCreate(&child_streams[i]);
cudaStreamWaitEvent(child_streams[i], parent_event, 0);
g<<<1,1,0,child_streams[i]>>>(arg_g);
/* Create a Synchronization point in the child stream */
cudaEventCreate(&child_events[i]);
cudaEventRecord(child_events[i], child_streams[i]);
cudaStreamWaitEvent(parent_stream, child_events[i], 0);
}
/* Enqueue work in the parent stream again, this work will not be
* executed until all work is done in the children. */
cudaSetDevice(0);
h<<<1,1,0, parent_stream>>>(arg_h);

Multi-GPU Fork-Join with concurrent kernels

On Fermi, one can also exploit multiple streams to execute kernels concurrently within the same device. Since all operations enqueued in a stream are serialized, the only way to implement concurrent kernels is indeed to maintain multiple streams and to enqueue kernels in the different streams. This is for instance useful if the g kernel is not large enough to fully scale over the entire CUDA device.

We can distribute possibly very small kernels over multiple devices, and we execute multiple kernels concurrently in case the device supports it (i.e. since Fermi).

C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#define NKERNELS 1024
#define STREAMS_PER_DEVICE 8
int ndevices;
cudaGetDeviceCount(&ndevices);
int nstreams = ndevices*STREAMS_PER_DEVICE;
cudaStream_t parent_stream, child_streams[nstreams];
cudaEvent_t parent_event, child_events[nstreams];
/* Initialize the main stream and enqueue some work in it */
cudaSetDevice(0);
cudaStreamCreate(&parent_stream);
init_kernel<<<1,1,0, parent_stream>>>(arg_f);
/* Create a Synchronization point in the parent stream */
cudaEventCreate(&parent_event);
cudaEventRecord(parent_event, parent_stream);
/* Create all streams and the initial dependency with f. */
for (int i = 0; i < nstreams; i++)
{
int dev = i % ndevices;
cudaSetDevice(dev);
cudaStreamCreate(&child_streams[i]);
cudaStreamWaitEvent(child_streams[i], parent_event, 0);
}
/* Dispatch the (numerous) tasks with a round-robin scheme */
for (int i = 0; i < NKERNELS; i++)
{
cudaSetDevice(i % ndevices);
g<<<1, 1, 0, child_streams[i % nstreams]>>>(arg_g);
}
/* Synchronize children with the parent stream. */
for (int i = 0; i < nstreams; i++)
{
int dev = i % ndevices;
cudaSetDevice(dev);
cudaEventCreate(&child_events[i]);
cudaEventRecord(child_events[i], child_streams[i]);
cudaStreamWaitEvent(parent_stream, child_events[i], 0);
}
/* Enqueue work in the parent stream again, this work will not be
* executed until all work is done in the children. */
cudaSetDevice(0);
h<<<1,1,0, parent_stream>>>(arg_h);

Note that the previous piece of code is supposed to be fully asynchronous, however in practice you may observe (if you really analyze performance carefully) that the CUDA driver may block from time to time if there are already too many pending asynchronous operations. This does not affect correctness, and it is not likely to affect performance either because the CUDA devices are not blocked when the driver temporarily blocks the host.

, , , , , ,

Comments are closed.