Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I parallelize a for loop through a C++ std::list using OpenMP?

I would like to iterate through all elements in an std::list in parallel fashion using OpenMP. The loop should be able to alter the elements of the list. Is there a simple solution for this? It seems that OpenMP 3.0 supports parallel for loops when the iterator is a Random Access Iterator, but not otherwise. In any case, I would prefer to use OpenMP 2.0 as I don't have full control over which compilers are available to me.

If my container were a vector, I might use:

#pragma omp parallel for
for (auto it = v.begin(); it != v.end(); ++it) {
    it->process();
}

I understand that I could copy the list into a vector, do the loop, then copy everything back. However, I would like to avoid this complexity and overhead if possible.

like image 628
mshang Avatar asked Jan 01 '12 01:01

mshang


People also ask

How do I parallelize a loop in OpenMP?

The #pragma omp parallel for creates a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size , and the default schedule which is typically static .

Can you parallelize a while loop?

In addition, it is common for while loops to check for a condition using last iteration's result. This is called Read-after-Write or true-dependency and cannot be parallelized.

What is OpenMP parallel for?

OpenMP is a library for parallel programming in the SMP (symmetric multi-processors, or shared-memory processors) model. When programming with OpenMP, all threads share memory and data. OpenMP supports C, C++ and Fortran. The OpenMP functions are included in a header file called omp.


3 Answers

If you decide to use Openmp 3.0, you can use the task feature:

#pragma omp parallel #pragma omp single {   for(auto it = l.begin(); it != l.end(); ++it)      #pragma omp task firstprivate(it)        it->process();   #pragma omp taskwait } 

This will execute the loop in one thread, but delegate the processing of elements to others.

Without OpenMP 3.0 the easiest way would be writing all pointers to elements in the list (or iterators in a vector and iterating over that one. This way you wouldn't have to copy anything back and avoid the overhead of copying the elements themselves, so it shouldn't have to much overhead:

std::vector<my_element*> elements; //my_element is whatever is in list for(auto it = list.begin(); it != list.end(); ++it)   elements.push_back(&(*it));  #pragma omp parallel shared(chunks) {   #pragma omp for   for(size_t i = 0; i < elements.size(); ++i) // or use iterators in newer OpenMP       elements[i]->process(); } 

If you want to avoid copying even the pointers, you can always create a parallelized for loop by hand. You can either have the threads access interleaved elements of the list (as proposed by KennyTM) or split the range in roughly equal contious parts before iterating and iterating over those. The later seems preferable since the threads avoid accessing listnodes currently processed by other threads (even if only the next pointer), which could lead to false sharing. This would look roughly like this:

#pragma omp parallel {   int thread_count = omp_get_num_threads();   int thread_num   = omp_get_thread_num();   size_t chunk_size= list.size() / thread_count;   auto begin = list.begin();   std::advance(begin, thread_num * chunk_size);   auto end = begin;   if(thread_num = thread_count - 1) // last thread iterates the remaining sequence      end = list.end();   else      std::advance(end, chunk_size);   #pragma omp barrier   for(auto it = begin; it != end; ++it)     it->process(); } 

The barrier is not strictly needed, however if process mutates the processed element (meaning it is not a const method), there might be some sort of false sharing without it, if threads iterate over a sequence which is already being mutated. This way will iterate 3*n times over the sequence (where n is the number of threads), so scaling might be less then optimal for a high number of threads.

To reduce the overhead you could put the generation of the ranges outside of the #pragma omp parallel, however you will need to know how many threads will form the parallel section. So you'd probably have to manually set the num_threads, or use omp_get_max_threads() and handle the case that the number of threads created is less then omp_get_max_threads() (which is only an upper bound). The last way could be handled by possibly assigning each thread severa chunks in that case (using #pragma omp for should do that):

int max_threads = omp_get_max_threads(); std::vector<std::pair<std::list<...>::iterator, std::list<...>::iterator> > chunks; chunks.reserve(max_threads);  size_t chunk_size= list.size() / max_threads; auto cur_iter = list.begin(); for(int i = 0; i < max_threads - 1; ++i) {    auto last_iter = cur_iter;    std::advance(cur_iter, chunk_size);    chunks.push_back(std::make_pair(last_iter, cur_iter); } chunks.push_back(cur_iter, list.end();  #pragma omp parallel shared(chunks) {   #pragma omp for   for(int i = 0; i < max_threads; ++i)     for(auto it = chunks[i].first; it != chunks[i].second; ++it)       it->process(); } 

This will take only three iterations over list (two, if you can get the size of the list without iterating). I think that is about the best you can do for non random access iterators without using tasks or iterating over some out of place datastructure (like a vector of pointer).

like image 190
Grizzly Avatar answered Sep 23 '22 15:09

Grizzly


I doubt it is possible since you can't just jump into the middle of a list without traversing the list. Lists are not stored in contiguous memory and std::list iterators are not Random Access. They are only bidirectional.

like image 32
jmucchiello Avatar answered Sep 25 '22 15:09

jmucchiello


http://openmp.org/forum/viewtopic.php?f=3&t=51

#pragma omp parallel
{
   for(it= list1.begin(); it!= list1.end(); it++)
   {
      #pragma omp single nowait
      {
         it->compute();
      }
   } // end for
} // end ompparallel

This can be understood as unrolled as:

{
  it = listl.begin
  #pragma omp single nowait
  {
    it->compute();
  }
  it++;
  #pragma omp single nowait
  {
    it->compute();
  }
  it++;
...
}

Given a code like this:

int main()                                                                      
{                                                                               
        std::vector<int> l(4,0);                                                
        #pragma omp parallel for                                                        
        for(int i=0; i<l.size(); ++i){                                          
                printf("th %d = %d \n",omp_get_thread_num(),l[i]=i);            
        }                                                                       
        printf("\n");                                                           
       #pragma omp parallel                                                            
        {                                                                       
                for (auto i = l.begin(); i != l.end(); ++i) {                   
               #pragma omp single nowait                                                       
                {                                                       
                        printf("th %d = %d \n",omp_get_thread_num(),*i);
                }                                                       
            }                                                               
        }                                                                       
        return 0;                                                               
} 

export OMP_NUM_THREADS=4, output as follows (note the second section, work thread number can repeat):

th 2 = 2 
th 1 = 1 
th 0 = 0 
th 3 = 3 

th 2 = 0 
th 1 = 1 
th 2 = 2 
th 3 = 3
like image 20
Jichao Avatar answered Sep 21 '22 15:09

Jichao