I am in the process of learning how to use OpenMP in C, and as a HelloWorld exercise I am writing a program to count primes. I then parallelise this as follows:
int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
I compile this code using gcc -g -Wall -fopenmp -o primes primes.c -lm
(-lm
for the math.h
functions I am using). Then I run this code on an Intel® Core™2 Duo CPU E8400 @ 3.00GHz × 2
, and as expected, the performance is better than for a serial program.
The problem, however, comes when I try to run this on a much more powerful machine. (I have also tried to manually set the number of threads to use with num_threads
, but this did not change anything.) Counting all the primes up to 10 000 000
gives me the following times (using time
):
8-core machine:
real 0m8.230s
user 0m50.425s
sys 0m0.004s
dual-core machine:
real 0m10.846s
user 0m17.233s
sys 0m0.004s
And this pattern continues for counting more primes, the machine with more cores shows a slight performance increase, but not as much as I would expect for having so many more cores available. (I would expect 4 times more cores to imply almost 4 times less running time?)
Counting primes up to 50 000 000
:
8-core machine:
real 1m29.056s
user 8m11.695s
sys 0m0.017s
dual-core machine:
real 1m51.119s
user 2m50.519s
sys 0m0.060s
If anyone can clarify this for me, it would be much appreciated.
EDIT
This is my prime-checking function.
static int is_prime(int n)
{
/* handle special cases */
if (n == 0) return 0;
else if (n == 1) return 0;
else if (n == 2) return 1;
int i;
for(i=2;i<=(int)(sqrt((double) n));i++)
if (n%i==0) return 0;
return 1;
}
This performance is happening because:
is_prime(i)
takes longer the higher i
gets, andparallel for
constructs without the schedule
clause, i.e. it chops the for loop into equal sized contiguous chunks. In other words, the highest-numbered thread is doing all of the hardest operations.
Explicitly selecting a more appropriate scheduling type with the schedule
clause allows you to divide work among the threads fairly.
This version will divide the work better:
int numprimes = 0;
#pragma omp parallel for schedule(dynamic, 1) reduction(+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
Information on scheduling syntax is available via MSDN and Wikipedia.
schedule(dynamic, 1)
may not be optimal, as High Performance Mark notes in his answer. There is a more in-depth discussion of scheduling granularity in this OpenMP wihtepaper.
Thanks also to Jens Gustedt and Mahmoud Fayez for contributing to this answer.
The reason for the apparently poor scaling of your program is, as @naroom has suggested, the variability in the run time of each call to your is_prime
function. The run time does not simply increase with the value of i
. Your code shows that the test terminates as soon as the first factor of i
is found so the longest run times will be for numbers with few (and large) factors, including the prime numbers themselves.
As you've already been told, the default schedule for your parallelisation will parcel out the iterations of the master loop a chunk at a time to the available threads. For your case of 5*10^7
integers to test and 8 cores to use, the first thread will get the integers 1..6250000
to test, the second will get 6250001..12500000
and so on. This will lead to a severely unbalanced load across the threads because, of course, the prime numbers are not uniformly distributed.
Rather than using the default scheduling you should experiment with dynamic scheduling. The following statement tells the run-time to parcel out the iterations of your master loop m
iterations at a time to the threads in your computation:
#pragma omp parallel for schedule(dynamic,m)
Once a thread has finished its m
iterations it will be given m
more to work on. The trick for you is to find the sweet spot
for m
. Too small and your computation will be dominated by the work that the run time does in parcelling out iterations, too large and your computation will revert to the unbalanced loads that you have already seen.
Take heart though, you will learn some useful lessons about the costs, and benefits, of parallel computation by working through all of this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With