Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP Parallel for-loop showing little performance increase

I am in the process of learning how to use OpenMP in C, and as a HelloWorld exercise I am writing a program to count primes. I then parallelise this as follows:

int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes)
for (i = 1; i <= n; i++)
{
    if (is_prime(i) == true)
        numprimes ++;
}

I compile this code using gcc -g -Wall -fopenmp -o primes primes.c -lm (-lm for the math.h functions I am using). Then I run this code on an Intel® Core™2 Duo CPU E8400 @ 3.00GHz × 2, and as expected, the performance is better than for a serial program.

The problem, however, comes when I try to run this on a much more powerful machine. (I have also tried to manually set the number of threads to use with num_threads, but this did not change anything.) Counting all the primes up to 10 000 000 gives me the following times (using time):

8-core machine:

real    0m8.230s
user    0m50.425s
sys     0m0.004s

dual-core machine:

real    0m10.846s
user    0m17.233s
sys     0m0.004s

And this pattern continues for counting more primes, the machine with more cores shows a slight performance increase, but not as much as I would expect for having so many more cores available. (I would expect 4 times more cores to imply almost 4 times less running time?)

Counting primes up to 50 000 000:

8-core machine:

real    1m29.056s
user    8m11.695s
sys     0m0.017s

dual-core machine:

real    1m51.119s
user    2m50.519s
sys     0m0.060s

If anyone can clarify this for me, it would be much appreciated.

EDIT

This is my prime-checking function.

static int is_prime(int n)
{
  /* handle special cases */
  if      (n == 0) return 0;
  else if (n == 1) return 0;
  else if (n == 2) return 1;

  int i;
  for(i=2;i<=(int)(sqrt((double) n));i++)
    if (n%i==0) return 0;

  return 1;
}
like image 519
hjweide Avatar asked Mar 17 '13 16:03

hjweide


2 Answers

This performance is happening because:

  1. is_prime(i) takes longer the higher i gets, and
  2. Your OpenMP implementation uses static scheduling by default for parallel for constructs without the schedule clause, i.e. it chops the for loop into equal sized contiguous chunks.

In other words, the highest-numbered thread is doing all of the hardest operations.

Explicitly selecting a more appropriate scheduling type with the schedule clause allows you to divide work among the threads fairly.

This version will divide the work better:

int numprimes = 0;
#pragma omp parallel for schedule(dynamic, 1) reduction(+:numprimes) 
for (i = 1; i <= n; i++)
{
    if (is_prime(i) == true)
        numprimes ++;
}

Information on scheduling syntax is available via MSDN and Wikipedia.

schedule(dynamic, 1) may not be optimal, as High Performance Mark notes in his answer. There is a more in-depth discussion of scheduling granularity in this OpenMP wihtepaper.

Thanks also to Jens Gustedt and Mahmoud Fayez for contributing to this answer.

like image 102
14 revs, 2 users 93% Avatar answered Oct 09 '22 17:10

14 revs, 2 users 93%


The reason for the apparently poor scaling of your program is, as @naroom has suggested, the variability in the run time of each call to your is_prime function. The run time does not simply increase with the value of i. Your code shows that the test terminates as soon as the first factor of i is found so the longest run times will be for numbers with few (and large) factors, including the prime numbers themselves.

As you've already been told, the default schedule for your parallelisation will parcel out the iterations of the master loop a chunk at a time to the available threads. For your case of 5*10^7 integers to test and 8 cores to use, the first thread will get the integers 1..6250000 to test, the second will get 6250001..12500000 and so on. This will lead to a severely unbalanced load across the threads because, of course, the prime numbers are not uniformly distributed.

Rather than using the default scheduling you should experiment with dynamic scheduling. The following statement tells the run-time to parcel out the iterations of your master loop m iterations at a time to the threads in your computation:

#pragma omp parallel for schedule(dynamic,m)

Once a thread has finished its m iterations it will be given m more to work on. The trick for you is to find the sweet spot for m. Too small and your computation will be dominated by the work that the run time does in parcelling out iterations, too large and your computation will revert to the unbalanced loads that you have already seen.

Take heart though, you will learn some useful lessons about the costs, and benefits, of parallel computation by working through all of this.

like image 23
High Performance Mark Avatar answered Oct 09 '22 18:10

High Performance Mark