Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ OpenMP Parallel For Loop - Alternatives to std::vector [closed]

Tags:

Based on this thread, OpenMP and STL vector, which data structures are good alternatives for a shared std::vector in a parallel for loop? The main aspect is speed, and the vector might require resizing during the loop.

like image 637
Armin Meisterhirn Avatar asked Sep 07 '13 03:09

Armin Meisterhirn


2 Answers

I think you can use std::vector with OpenMP most of the time and still have good performance. The following code for example fills std::vectors in parallel and then combines them in the end. As long as your main loop/fill function is the bottleneck this should work well in general and be thread safe.

std::vector<int> vec; #pragma omp parallel {     std::vector<int> vec_private;     #pragma omp for nowait //fill vec_private in parallel     for(int i=0; i<100; i++) {         vec_private.push_back(i);     }     #pragma omp critical     vec.insert(vec.end(), vec_private.begin(), vec_private.end()); } 

Edit:

OpenMP 4.0 allows user-defined reductions using #pragma omp declare reduction. The code above can be simplified with to

#pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))  std::vector<int> vec; #pragma omp parallel for reduction(merge: vec) for(int i=0; i<100; i++) vec.push_back(i); 

Edit: What I have shown so far does not fill the vector in order. If the order matters then this can be done like this

std::vector<int> vec; #pragma omp parallel {     std::vector<int> vec_private;     #pragma omp for nowait schedule(static)     for(int i=0; i<N; i++) {          vec_private.push_back(i);     }     #pragma omp for schedule(static) ordered     for(int i=0; i<omp_get_num_threads(); i++) {         #pragma omp ordered         vec.insert(vec.end(), vec_private.begin(), vec_private.end());     } } 

This avoids saving a std::vector for each thread and then merging them in serial outside of the parallel region. I learned about this "trick" here. I'm not sure how to do this (or if it's even possible) for user-defined reductions.. It's not possible to do this with user-defined reductions.

I just realized that the critical section is not necessary which I figured out from this question parallel-cumulative-prefix-sums-in-openmp-communicating-values-between-thread. This method also gets the order correct as well

std::vector<int> vec; size_t *prefix; #pragma omp parallel {     int ithread  = omp_get_thread_num();     int nthreads = omp_get_num_threads();     #pragma omp single     {         prefix = new size_t[nthreads+1];         prefix[0] = 0;     }     std::vector<int> vec_private;     #pragma omp for schedule(static) nowait     for(int i=0; i<100; i++) {         vec_private.push_back(i);     }     prefix[ithread+1] = vec_private.size();     #pragma omp barrier     #pragma omp single      {         for(int i=1; i<(nthreads+1); i++) prefix[i] += prefix[i-1];         vec.resize(vec.size() + prefix[nthreads]);     }     std::copy(vec_private.begin(), vec_private.end(), vec.begin() + prefix[ithread]); } delete[] prefix; 
like image 110
Z boson Avatar answered Sep 20 '22 17:09

Z boson


The question you link was talking about the fact that "that STL vector container is not thread-safe in the situation where multiple threads write to a single container" . This is only true, as stated correctly there, if you call methods that can cause reallocation of the underlying array that std::vector holds. push_back(), pop_back() and insert() are examples of these dangerous methods.

If you need thread safe reallocation, then the library intel thread building block offers you concurrent vector containers . You should not use tbb::concurrent_vector in single thread programs because the time it takes to access random elements is higher than the time std::vector takes to do the same (which is O(1)). However, concurrent vector calls push_back(), pop_back(), insert() in a thread safe way, even when reallocation happens.

EDIT 1: The slides 46 and 47 of the following Intel presentation give an illustrative example of concurrent reallocation using tbb::concurrent_vector

EDIT 2: By the way, if you start using Intel Tread Building Block (it is open source, it works with most compilers and it is much better integrated with C++/C++11 features than openmp), then you don't need to use openmp to create a parallel_for, Here is a nice example of parallel_for using tbb.

like image 43
Vivian Miranda Avatar answered Sep 22 '22 17:09

Vivian Miranda