Parallel, but slower

Question

I'm using monte carlo method to calculate pi and do a basic experience with parallel programming and openmp

the problem is that when i use 1 thread, x iterations, always runs faster than n thread, x iterations. Can anyone tell me why?

For example the code runs like this "a.out 1 1000000", where 1 is threads and 1000000 the iterations

include <omp.h>
include <stdio.h>
include <stdlib.h>
include <iostream>
include <iomanip>
include <math.h>

using namespace std;

int main (int argc, char *argv[]) {

double arrow_area_circle, pi;
float xp, yp;
int i, n;
double pitg= atan(1.0)*4.0; //for pi error
cout << "Number processors: " << omp_get_num_procs() << endl;

//Number of divisions
iterarions=atoi(argv[2]); 
arrow_area_circle = 0.0;

#pragma omp parallel num_threads(atoi(argv[1]))
{
srandom(omp_get_thread_num());

#pragma omp for private(xp, yp) reduction(+:arrow_area_circle) //*,/,-,+
for (i = 0; i < iterarions; i++) {
    xp=rand()/(float)RAND_MAX;
    yp=rand()/(float)RAND_MAX;

    if(pow(xp,2.0)+pow(yp,2.0)<=1) arrow_area_circle++;
}
}

pi = 4*arrow_area_circle / iterarions;
cout << setprecision(18) << "PI = " << pi << endl << endl;
cout << setprecision(18) << "Erro = " << pitg-pi << endl << endl;

return 0;
}

Michael · Accepted Answer

A CPU intensive task like this will be slower if you do the work in more threads than there are CPU's in the system. If you are running it on a single CPU system, you will definitely see a slowdown with more than one thread. This is due to the OS having to switch between the various threads - this is pure overhead. You should ideally have the same number of threads as cores for a task like this.

Another issue is that arrow_area_circle is shared between threads. If you have a thread running on each core, incrementing arrow_area_circle will invalidate the copy in the caches of the other cores, causing them to have to refetch. arrow_area_circle++ which should take a couple cycles will take dozens or hundreds of cycles. Try creating an arrow_area_circle per thread and combining them at the end.

EDIT: Joe Duffy just posted a blog entry on the cost of sharing data between threads.

Paul Hsieh · Answer

It looks like you are using some kind of auto-parallelizing compiler. I am going to assume you have more than 1 core/CPU in your system (as that would be too obvious -- and no hyperthreading on a Pentium 4 doesn't count as having two cores, regardless of what Intel's marketing would have you believe.) There are two problems that I see. The first is trivial and probably not your problem:

If the variable arrow_area_circle is shared between your processes, then the act of executing arrow_area_circle++ will cause an interlocking instruction to be used to synchronize the value in a way that is atomically sound. You should increment a "private" variable, then add that value just once at the end to the common arrow_area_circle variable instead of incrementing arrow_area_circle in your inner loop.
The rand() function, to operate soundly, must internally execute with a critical section. The reason is that its internal state/seed is a static shared variable; if it were not, it would be possible for two different processes to get the same output from rand() with unusually high probability, just because they were calling rand() at nearly the same time. That means rand() runs slowly, and especially so as more threads/processes are calling it at the same time. Unlike the arrow_area_circle variable (which just needs an atomic increment), a true critical section has to be invoked by rand() because its state update is more complicated. To work around this, you should obtain the source code for your own random number generator and use it with a private seed or state. The source code for the standard rand() implementation in most compilers is widely available.

I'd also like to point out that you are using the pow(,) function to perform the same thing as x * x. The later is about 300 times faster than the former. Though this point is irrelevant to the question you are asking. :)

Marius Burz · Answer

Context switching.

Parallel, but slower

Tags:

multithreading

parallel-processing

pi

openmp

montecarlo

blueomega

3 Answers

Michael

Paul Hsieh

Marius Burz

Recent Activity

Donate For Us

Parallel, but slower

Tags:

multithreading

parallel-processing

pi

openmp

montecarlo

blueomega

3 Answers

Michael

Paul Hsieh

Marius Burz

Related questions

Recent Activity

Donate For Us