Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use all the cores in the loop?

There is a loop.

for (int i = 0; i < n; ++i) {
    //...
    v[i] = o.f(i);
    //...
}

Each v[i] = o.f(i) is independent of all the other v[i] = o.f(i).
n can be any value and it may not be a multiple of the number of cores. What is the simplest way to use all the cores to do this?

like image 333
Ufx Avatar asked Mar 14 '18 15:03

Ufx


People also ask

Can multiple cores run the same process?

Yes, a single process can run multiple threads on different cores. Caching is specific to the hardware. Many modern Intel processors have three layers of caching, where the last level cache is shared across cores.


2 Answers

The ExecutionPolicy overloads of the algorithms in <algorithm> exist for this purpose. std::transform applies a function to each element of a source range to assign to a destination range.

v.begin() is an acceptable destination, so long as v is of appropriate size. Your snippet assumes this when it uses v[i], so I will too.

We then need an iterator that gives the values [0, n) as our source, so boost::counting_iterator<int>.

Finally we need a Callable that will apply o.f to our values, so lets capture o in a lambda.

#include <algorithm>
#include <execution>
#include <boost/iterator/counting_iterator.hpp>

// assert(v.size() >= n)
std::transform(std::execution::par, boost::counting_iterator<int>(0), boost::counting_iterator<int>(n), v.begin(), [&o](int i){ return o.f(i); });

If o.f does not perform any "vectorization-unsafe operations", you are able to use std::execution::par_unseq, which may interleave calls on the same thread (i.e. unroll the loop and use SIMD instructions)

like image 112
Caleth Avatar answered Nov 13 '22 14:11

Caleth


In the land of existing compilers, and remembering that M/S can't even get this stuff right for C++11, never mind about C++17/20, the C++11 answer goes something like:

typedef v.value_type R;
std::vector< std::future<R> > fut(n);
for (int i=0; i<n; i++)
    fut[i] = std::async(std::launch::async, O::f, o, i);
for (auto& f : fut)
    v.push_back(f.get());

@arne suggests we can do better by throttling the number of tasks by considering the number of processors (P), which is true, though the above code will give you a clear indication on whether you will really benefit from multi-threading the method f. Given we only want to launch X jobs simultaneously, where X is > P, < 3*P depending on the variation in job complexity (note I am relying on a signed index):

typedef v.value_type R;
std::vector< std::future<R> > fut(n);
for (ssize_t i=0, j=-X; j<n; i++,j++)
{
    if (i<n)    fut[i] = std::async(std::launch::async, O::f, o, i);
    if (j>=0)   v.push_back(fut[j].get());
}

I'm not claiming the above code is "great", but if the jobs are complex enough for us to need multithreading, the cost of looping a few extra times isn't gointg to be noticed. You will notice that if X > n the loop will spin a few times in the middle, but will produce the correct result :-)

like image 4
Gem Taylor Avatar answered Nov 13 '22 14:11

Gem Taylor