Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vector filling across OpenMP threads

I have an algorithm for which one goal is to fill vectors. For the sake of performance, iterations of the algorithm are spread across OpenMP threads. I was wondering which way would provide the better/safer way of filling the vectors.

Note that the ordering of the vectors must be consistent (i.e. value n of vec1 must come from the same iteration than value n of vec2.)

Hypothesis 1:

std::vector<BasicType> vec1;
std::vector<BasicType> vec2;
#pragma opm parallel for
for(...)
{
    // Do some intensive stuff to compute val1 and val2
    // ...

    #pragma omp critical
    {
        vec1.push_back(val1);
        vec2.push_back(val2);
    }
}
// Then go on to work with vec1 and vec2...

Hypothesis 2:

std::vector<BasicType> vec1;
std::vector<BasicType> vec2;
#pragma opm parallel
{
    std::vector<BasicType> vec1local;
    std::vector<BasicType> vec2local;
    #pragma omp for
    for(...)
    {
        // Do some intensive stuff to compute val1 and val2
        // ...

        vec1local.push_back(val1);
        vec2local.push_back(val2);
    }

    #pragma omp critical
    {
        // See note 1 below
        vec1.insert(vec1.end(), vec1local.begin(), vec1local.end());
        vec2.insert(vec2.end(), vec2local.begin(), vec2local.end());
    }
}
// Then go on to work with vec1 and vec2...

Note 1: This was taken from Best way to append vector to vector

Note 2: It seems val1 and val2 could be combined in some object in order to maintain consistency and work only with one vector, but for now it seems inpractical for the remainder of the algorithm.

Note 3: To give an order of magnitude, the for loop contains around 100 iterations splitted among 4 threads. Apart for very few exceptions, each iteration is expected to have the same workload (which brings the issue of the critical sections happening around the same time.)

Note 4: Just for the records, the whole thing deals with image stabilisation, and runs on a Tegra K1 architecture. Compiler used is gcc 4.8.4.

like image 648
Vincent COISY Avatar asked Mar 31 '16 14:03

Vincent COISY


1 Answers

From your two suggestions, I would prefer the first. It is simpler and does not require additional memory. However, I would suggest an alternative without the critical section:

std::vector<BasicType> vec1(size);
std::vector<BasicType> vec2(size);
#pragma opm parallel for
for(...)
{
    // Do some intensive stuff to compute val1 and val2
    // ...

    vec1[i] = val1;
    vec2[i] = val2;
}

Note that this may have some performance issues due to cache line stealing. However, I wouldn't worry about that unless it turns out to be an actual problem validated by actual performance analysis. A solution could then be:

  • Go with your second solution. (which costs memory and additional post-processing)
  • Align your vector and use appropriate chunks for the loop. (such that each thread has local cache lines)
  • Use a data structure that internally contains the local vectors, but to the outside provides the necessary global vector-operations. (This is probably overall the best solution.)
like image 131
Zulan Avatar answered Sep 24 '22 12:09

Zulan