I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:
#include <vector> #include <algorithm> using namespace std; int main () { int n=400000, m=1000; double x=0,y=0; double s=0; vector< double > shifts(n,0); #pragma omp parallel for for (int j=0; j<n; j++) { double r=0.0; for (int i=0; i < m; i++){ double rand_g1 = cos(i/double(m)); double rand_g2 = sin(i/double(m)); x += rand_g1; y += rand_g2; r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2); } shifts[j] = r / m; } cout << *std::max_element( shifts.begin(), shifts.end() ) << endl; }
I compile it with
g++ -O3 testMP.cc -o testMP -I /opt/boost_1_48_0/include
that is, no "-fopenmp", and I get these timings:
real 0m18.417s user 0m18.357s sys 0m0.004s
when I do use "-fopenmp",
g++ -O3 -fopenmp testMP.cc -o testMP -I /opt/boost_1_48_0/include
I get these numbers for the times:
real 0m6.853s user 0m52.007s sys 0m0.008s
which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?
The #pragma omp parallel for creates a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size , and the default schedule which is typically static .
The OpenMP clause: #pragma omp parallel. creates a parallel region with a team of threads , where each thread will execute the entire block of code that the parallel region encloses.
Work-sharing constructs can be used to divide a task among the threads so that each thread executes its allocated part of the code. Both task parallelism and data parallelism can be achieved using OpenMP in this way.
You should make use of the OpenMP reduction
clause for x
and y
:
#pragma omp parallel for reduction(+:x,y) for (int j=0; j<n; j++) { double r=0.0; for (int i=0; i < m; i++){ double rand_g1 = cos(i/double(m)); double rand_g2 = sin(i/double(m)); x += rand_g1; y += rand_g2; r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2); } shifts[j] = r / m; }
With reduction
each thread accumulates its own partial sum in x
and y
and in the end all partial values are summed together in order to obtain the final values.
Serial version: 25.05s user 0.01s system 99% cpu 25.059 total OpenMP version w/ OMP_NUM_THREADS=16: 24.76s user 0.02s system 1590% cpu 1.559 total
See - superlinear speed-up :)
let's try to understand how parallelize simple for loop using OpenMP
#pragma omp parallel #pragma omp for for(i = 1; i < 13; i++) { c[i] = a[i] + b[i]; }
assume that we have 3
available threads, this is what will happen
firstly
and finally
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With