I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code: <pre class="prettyprint"><code>#include <vector> #include <algorithm> using namespace std; int main () { int n=400000, m=1000; double x=0,y=0; double s=0; vector< double > shifts(n,0); #pragma omp parallel for for (int j=0; j<n; j++) { double r=0.0; for (int i=0; i < m; i++){ double rand_g1 = cos(i/double(m)); double rand_g2 = sin(i/double(m)); x += rand_g1; y += rand_g2; r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2); } shifts[j] = r / m; } cout << *std::max_element( shifts.begin(), shifts.end() ) << endl; } </code></pre> I compile it with <pre class="prettyprint"><code>g++ -O3 testMP.cc -o testMP -I /opt/boost_1_48_0/include </code></pre> that is, no "-fopenmp", and I get these timings: <pre class="prettyprint"><code>real 0m18.417s user 0m18.357s sys 0m0.004s </code></pre> when I do use "-fopenmp", <pre class="prettyprint"><code>g++ -O3 -fopenmp testMP.cc -o testMP -I /opt/boost_1_48_0/include </code></pre> I get these numbers for the times: <pre class="prettyprint"><code>real 0m6.853s user 0m52.007s sys 0m0.008s </code></pre> which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?

You should make use of the OpenMP <code>reduction</code> clause for <code>x</code> and <code>y</code>: <pre class="prettyprint"><code>#pragma omp parallel for reduction(+:x,y) for (int j=0; j<n; j++) { double r=0.0; for (int i=0; i < m; i++){ double rand_g1 = cos(i/double(m)); double rand_g2 = sin(i/double(m)); x += rand_g1; y += rand_g2; r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2); } shifts[j] = r / m; } </code></pre> With <code>reduction</code> each thread accumulates its own partial sum in <code>x</code> and <code>y</code> and in the end all partial values are summed together in order to obtain the final values. <pre class="prettyprint lang-none prettyprint-override"><code>Serial version: 25.05s user 0.01s system 99% cpu 25.059 total OpenMP version w/ OMP_NUM_THREADS=16: 24.76s user 0.02s system 1590% cpu 1.559 total </code></pre> See - superlinear speed-up :)

let's try to understand how parallelize simple for loop using OpenMP <pre class="prettyprint"><code>#pragma omp parallel #pragma omp for for(i = 1; i < 13; i++) { c[i] = a[i] + b[i]; } </code></pre> assume that we have <code>3</code> available threads, this is what will happen <img src="https://i.stack.imgur.com/inNOQ.jpg" alt="enter image description here"> firstly <ul> <li>Threads are assigned an independent set of iterations</li> </ul> and finally <ul> <li>Threads must wait at the end of work-sharing construct</li> </ul>

Parallel for loop in openmp

Tags:

c++

performance

multithreading

parallel-processing

openmp

I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:

#include <vector> #include <algorithm>  using namespace std;  int main ()  {     int n=400000,  m=1000;       double x=0,y=0;     double s=0;     vector< double > shifts(n,0);       #pragma omp parallel for      for (int j=0; j<n; j++) {          double r=0.0;         for (int i=0; i < m; i++){              double rand_g1 = cos(i/double(m));             double rand_g2 = sin(i/double(m));                   x += rand_g1;             y += rand_g2;             r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);         }         shifts[j] = r / m;     }      cout << *std::max_element( shifts.begin(), shifts.end() ) << endl; }

I compile it with

g++ -O3 testMP.cc -o testMP  -I /opt/boost_1_48_0/include

that is, no "-fopenmp", and I get these timings:

real    0m18.417s user    0m18.357s sys     0m0.004s

when I do use "-fopenmp",

g++ -O3 -fopenmp testMP.cc -o testMP  -I /opt/boost_1_48_0/include

I get these numbers for the times:

real    0m6.853s user    0m52.007s sys     0m0.008s

which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?

388

asked Aug 02 '12 07:08

dsign

2 Answers

You should make use of the OpenMP reduction clause for x and y:

#pragma omp parallel for reduction(+:x,y) for (int j=0; j<n; j++) {      double r=0.0;     for (int i=0; i < m; i++){          double rand_g1 = cos(i/double(m));         double rand_g2 = sin(i/double(m));               x += rand_g1;         y += rand_g2;         r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);     }     shifts[j] = r / m; }

With reduction each thread accumulates its own partial sum in x and y and in the end all partial values are summed together in order to obtain the final values.

Serial version: 25.05s user 0.01s system 99% cpu 25.059 total OpenMP version w/ OMP_NUM_THREADS=16: 24.76s user 0.02s system 1590% cpu 1.559 total

See - superlinear speed-up :)

answered Sep 18 '22 19:09

Hristo Iliev

let's try to understand how parallelize simple for loop using OpenMP

#pragma omp parallel #pragma omp for     for(i = 1; i < 13; i++)     {        c[i] = a[i] + b[i];     }

assume that we have 3 available threads, this is what will happen

enter image description here

firstly

Threads are assigned an independent set of iterations

and finally

Threads must wait at the end of work-sharing construct

answered Sep 19 '22 19:09

Basheer AL-MOMANI

Related questions
                            
                                Inheritance or composition: Rely on "is-a" and "has-a"?
                            
                                Incrementing iterators: Is ++it more efficient than it++? [duplicate]
                            
                                uSTL or STLPort for Android?
                            
                                Can I undo the effect of "using namespace" in C++?
                            
                                Variadic macros with zero arguments
                            
                                Converting from v8::Arguments to C++ Types
                            
                                Is it safe to swap two different vectors in C++, using the std::vector::swap method?
                            
                                Checking value exist in a std::map - C++
                            
                                Creating an input stream from constant memory
                            
                                C++ std::map items in descending order of keys
                            
                                C++ : Catch a divide by zero error
                            
                                How do I use MultiByteToWideChar?
                            
                                Calling template function within template class
                            
                                What is the difference between QCheckBox::toggled() and QCheckBox::clicked()?
                            
                                Using shared_from_this in templated classes
                            
                                What does assert(0) mean?
                            
                                Why does const auto &p{nullptr} work while auto *p{nullptr} doesn't in C++17?
                            
                                Compile error in 'winbase.h'
                            
                                Why do arrays of different integer sizes have different performance?
                            
                                How does the compiler benefit from C++'s new final keyword?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With