I try to write simple application using OpenMP. Unfortunately I have problem with speedup.
In this application I have one while loop. Body of this loop consists of some instructions which should be done sequentially and one for loop. I use #pragma omp parallel for
to make this for loop parallel. This loop doesn't have much work, but is called very often.
I prepare two versions of for loop, and run application on 1, 2 and 4cores.
version 1 (4 iterations in for loop): 22sec, 23sec, 26sec.
version 2 (100000 iterations in for loop): 20sec, 10sec, 6sec.
As you can see, when for loop doesn't have much work, time on 2 and 4 cores is higher than on 1core.
I guess the reason is that #pragma omp parallel for
creates new threads in each iteration of while loop. So, I would like to ask you - is there any possibility to create threads once (before while loop), and ensure that some job in while loop will be done sequentially?
#include <omp.h>
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main(int argc, char* argv[])
{
double sum = 0;
while (true)
{
// ...
// some work which should be done sequentially
// ...
#pragma omp parallel for num_threads(atoi(argv[1])) reduction(+:sum)
for(int j=0; j<4; ++j) // version 2: for(int j=0; j<100000; ++j)
{
double x = pow(j, 3.0);
x = sqrt(x);
x = sin(x);
x = cos(x);
x = tan(x);
sum += x;
double y = pow(j, 3.0);
y = sqrt(y);
y = sin(y);
y = cos(y);
y = tan(y);
sum += y;
double z = pow(j, 3.0);
z = sqrt(z);
z = sin(z);
z = cos(z);
z = tan(z);
sum += z;
}
if (sum > 100000000)
{
break;
}
}
return 0;
}
Most OpenMP implementations create a number of threads on program startup and keep them for the duration of the program. That is, most implementations don't dynamically create and destroy threads during execution; to do so would hit performance with severe thread management costs. This approach to thread management is consistent with, and appropriate for, the usual use cases for OpenMP.
It is far more likely that the slowdown you see when you increase the number of OpenMP threads is down to imposing a parallel overhead on a loop with a tiny number of iterations. Hristo's answer covers this.
You could move the parallel region outside of the while (true)
loop and use the single
directive to make the serial part of the code to execute in one thread only. This will remove the overhead of the fork/join model. Also OpenMP is not really useful on thight loops with very small number of iterations (like your version 1). You are basically measuring the OpenMP overhead since the work inside the loop is done really fast - even 100000 iterations with transcendental functions take less than second on current generation CPU (at 2 GHz and roughly 100 cycles per FP instruciton other than addition, it'll take ~100 ms).
That's why OpenMP provides the if(condition)
clause that can be used to selectively turn off the parallelisation for small loops:
#omp parallel for ... if(loopcnt > 10000)
for (i = 0; i < loopcnt; i++)
...
It is also advisable to use schedule(static)
for regular loops (that is for loops in which every iteration takes about the same time to compute).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With