CUDA kernels launches in the null stream are NOT synchronous

Today I'd like to point out some very common error found in CUDA codes which assume that calling a kernel with stream 0 (either explicitly or implicitly) will result in a synchronous kernel call. Prior to the introduction of streams in CUDA, users did not have to care about synchronization issues, as everything would typically [...]

Declaring dependencies with cudaStreamWaitEvent

cudaStreamWaitEvent is a very useful synchronization primitive which takes two arguments as input: a stream, and an event. Even if this not clear from its name, this is a non blocking function, all operations enqueued in the stream after calling cudaStreamWaitEvent will only be unlocked when the event is triggered. A simple example For example, in [...]

