Today I’d like to point out some very common error found in CUDA codes which assume that calling a kernel with stream 0 (either explicitly or implicitly) will result in a synchronous kernel call. Prior to the introduction of streams in CUDA, users did not have to care about synchronization issues, as everything would typically [...]
Declaring dependencies with cudaStreamWaitEvent
cudaStreamWaitEvent is a very useful synchronization primitive which takes two arguments as input: a stream, and an event. Even if this not clear from its name, this is a non blocking function, all operations enqueued in the stream after calling cudaStreamWaitEvent will only be unlocked when the event is triggered. A simple example For example, in [...]
Accessing pinned host memory directly from the device
When it comes to optimizing data transfers, ensuring that we use pinned memory is critical in CUDA. One way to use such pinned memory is to ask CUDA to allocate host memory with the cudaMallocHost function. With the UVA (Unified Virtual Addressing) mechanism added in CUDA 4.0, there is an additional behaviour that is worth [...]
Erastothene sieve using processes
This code computes all prime numbers using multiple processes. It illustrates the use of the pipe and the fork system calls. Note that this is by no mean an efficient approach.
Handling exceptions in C with sigsetjmp
The following piece of code proposes an answer to the question “How do I plot a function graph while handling undefined points?”. For example, imagine that you provide a method to plot an arbitrary function, and that the user gives a function that is not properly defined (for example when computing x->1/x with x = [...]