The post Computing pi by rolling (many many) dices appeared first on Cedric Augonnet.

]]>This technique is part of a class of algorithms called "Monte Carlo method" which consists in obtaining a numerical result by repeating the same experiment many times with a random input. Monte Carlo methods are widely used (in much more serious ways), but today we are only going to discuss a toy example.

To compute pi, the experiment we are going to repeat here consists in shooting a random point (x,y) within the unit square (such as x and y belong to [0,1]). The next Figure depicts this unit square as well as the unit circle (centered at (0,0) with a radius of 1) :

Assuming that we have a uniform distribution of points, the probability that a point within the unit square is also located in the unit circle corresponds to the ratio between the surface of the quarter (1) of circle and that of the square (pi/4).

```
cnt = 0;
for (i = 0 ; i < N ; i++)
{
// Shoot a point (x,y) within the unit square
x = rand([0; 1]);
y = rand([0; 1]);
// Test if the point is within the circle.
// This test has a probability of pi/4 to succeed.
if (x^2 + y^2 < 1)
{
cnt = cnt + 1;
}
}
return (4 * cnt)/N
```

The previous pseudo-code illustrates how to implement this method in practice. It returns the resulting evaluation of pi according to this algorithm. While this technique is easy to implement, it must however be noted that its convergence speed is really low, so that it is not really useful in practice to compute many digits of pi. Here are some examples of output to illustrate this slow convergence rate:

10^3 iterations: PI = 3.108 (error of 3.35e-02)

10^6 iterations: PI = 3.141924 (error of 3.31e-04)

10^9 iterations: PI = 3.141608 (error of 1.57e-05)

However, this method can be interesting to numerically integrate functions with possibly multiple dimensions, as illustrated by this Figure:

Besides their slow convergence rate, the outcomes of Monte Carlo experiments heavily rely on the quality of the random numbers generators. When parallelizing such algorithms, parallel random generation may actually be the hardest point to consider.

To conclude, here is a perfect candidate if you are looking for a synthetic benchmark that will scale massively or to test the quality of a random number generator, but it does not have much use in real life besides being a rather amusing way to compute pi...

The post Computing pi by rolling (many many) dices appeared first on Cedric Augonnet.

]]>The post Accessing pinned host memory directly from the device appeared first on Cedric Augonnet.

]]>With the UVA (Unified Virtual Addressing) mechanism added in CUDA 4.0, there is an additional behaviour that is worth mentioning: pinned memory allocated with cudaMallocHost is not only a valid host buffer, but it is also mapped in devices' memory. This means that CUDA kernels can directly read or write through the PCI bus using the host address.

For example, the following piece of code shows a kernel that increments a variable mapped in host memory. Note that we use atomic add operations because the latency to access the variable in main memory is so high that many CUDA threads are likely to read the same value concurrently.

Of course this trivial example is by no mean an efficient code, it simply illustrates that we need not always issue a costly pair of cudaMemcpy(Async) operations from the host when accessing very little elements, or with very irregular data access patterns.

```
#include <stdio.h>
#include <assert.h>
#define N 4096
static __global__ void inc_kernel(unsigned long long *cnt)
{
atomicAdd(cnt, 1);
}
int main(int argc, char **argv)
{
unsigned long long *cnt;
cudaMallocHost(&cnt, sizeof(unsigned long long));
*cnt = 0;
int i;
for (i = 0; i < N; i++)
inc_kernel<<<4, 4>>>(cnt);
cudaThreadSynchronize();
assert(*cnt == 16*N);
fprintf(stderr, "CNT %lu\n", *cnt);
return 0;
}
```

NB: the previous code must be compiled with the -arch=sm_20 flag (or higher) to have the atomicAdd function defined.

The post Accessing pinned host memory directly from the device appeared first on Cedric Augonnet.

]]>The post Erastothene sieve using processes appeared first on Cedric Augonnet.

]]>```
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int pid = 0;
int i;
int n;
int first = 1;
int filter(int fd_in)
{
int pipe_fd[2];
int smallest = -1;
int pid = 0;
/* print the first valid number ... */
read(fd_in, &n, sizeof(int));
if (n < 0) {
perror("Error : no valid number in that list :");
return -1;
}
printf("%d\n", n);
/* to avoid crappy fork side effects as it does not flush the
* cache for files IO ... weird but needed */
fflush(stdout);
smallest = n;
first = 1;
while (read(fd_in, &n, sizeof(int)) == sizeof(int) && (n != -1)) {
/* transmit all other non multiple numbers */
if (n % smallest != 0) {
/* start a new filter if this is the first prime */
if (first) {
first = 0;
if (pipe(pipe_fd) < 0) {
perror("Pipe failed :");
exit(-1);
}
pid = fork();
if (pid < 0) {
perror("Fork failed ... halting !");
exit(-1);
}
if (pid == 0) {
/* the child does not need those fd
* anymore : thus free them no to get
* a EMFILE */
close(fd_in);
close(pipe_fd[1]);
/* child will execute a new filter */
filter(pipe_fd[0]);
/* should not be reached !*/
printf("Error !\n");
exit(-1);
}
}
if (write(pipe_fd[1], &n, sizeof(int)) < 0) {
perror("Write failed :");
}
}
}
/* the list must end by a -1 and there is no more possible
* prime to be given to the child */
if (!first) {
i = -1;
write(pipe_fd[1], &i, sizeof(int));
}
/* wait for the created child to end, if any */
if (pid)
waitpid(pid, NULL, 0);
close(pipe_fd[0]);
close(fd_in);
/* that process will die now ... */
exit(0);
}
int main(int argc, char *argv[])
{
int n = 100;
int i;
int pipe_fd[2];
/* if there is an argument, then this is the number n */
if (argc == 2) {
n = atoi(argv[1]);
}
pipe(pipe_fd);
pid = fork();
if (pid < 0) {
printf("Fork failed ... halting !\n");
exit(-1);
}
if (!pid) {
filter(pipe_fd[0]);
printf("Error !\n");
}
for (i=2 ; i<n ; i++) {
write(pipe_fd[1], &i, sizeof(int));
}
i = -1;
write(pipe_fd[1], &i, sizeof(int));
waitpid(pid, NULL, 0);
return 0;
}
```

The post Erastothene sieve using processes appeared first on Cedric Augonnet.

]]>The post Handling exceptions in C with sigsetjmp appeared first on Cedric Augonnet.

]]>```
#include <unistd.h>
#include <math.h>
#include <setjmp.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
sigjmp_buf jmpbuf;
struct sigaction siga;
void handler(int sig)
{
longjmp(jmpbuf, 1);
}
int f(int x)
{
return (1/((x>0)?0:x));
}
void trace(typeof(f) func, int x)
{
float y = func(x);
printf("%f\n", y);
}
int main(void)
{
int x;
siga.sa_handler = handler;
sigaction(SIGFPE, &siga, NULL);
for (x = -5; x < 5; x++)
{
if (sigsetjmp(jmpbuf, 1) == 0) {
trace(f, x);
}
else {
printf("SIGFPE !\n");
}
}
return 0;
}
```

The post Handling exceptions in C with sigsetjmp appeared first on Cedric Augonnet.

]]>