C++17 upgraded 69 STL algorithms to support parallelism, by the use of an optional ExecutionPolicy parameter (as the 1st argument). eg.
std::sort(std::execution::par, begin(v), end(v));
I suspect the C++17 standard deliberately says nothing about how to implement the multi-threaded algorithms, leaving it up to the library writers to decide what is best (and allowing them to change their minds, later). Still, I'm keen to understand at a high level what issues are being considered in the implementation of the parallel STL algorithms.
Some questions on my mind include (but are not limited to!):
I realise the point of these parallel algorithms is to shield the Programmer from having to worry about these details. However, any info that gives me a high-level mental picture of what's going on inside the library calls would be appreciated.
Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, as specified in ISO/IEC 14882:2017 standard, commonly called C++17.
In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can do multiple operations in a given time. It has been a tradition of computer science to describe serial algorithms in abstract machine models, often the one known as random-access machine.
par means "execute in parallel", which permits the implementation to execute on multiple threads in parallel.
Most of these questions can not be answered by the standard as of today. However, your question, as I understand it, mixes two concepts:
C1. Constraints on parallel algorithms
C2. Execution of algorithms
All the C++17 parallel STL thing is about C1: it sets constraints on how instructions and/or threads could be interleaved/transformed in a parallel computation. On the other hand, C2 is about being standardized, the keyword is executor
(more on this later).
For C1, there are 3 standard policies (in std::execution::seq
, par
and par_unseq
) that correspond to every combination of task and instruction parallelism. For example, when performing an integer accumulation, par_unseq
could be used, since the order is not important. However, for float point arithmetic, where addition is not associative, a better fit would be seq
to, at least, get a deterministic result. In short: policies set constraints on parallel computation and these constraints could be potentially exploited by a smart compiler.
On the other hand, once you have a parallel algorithm and its constraints (and possibly after some optimization/transformation), the executor
will find a way to execute it. There are default executors (for CPU for example) or you can create your own, then, all that configuration regarding number of threads, workload, processing unit, etc... can be set.
As of today, C1 is in the standard, but not C2, so if you use C1 with a compliant compiler, you will not be able to specify which execution profile you want and the library implementation will decide for you (maybe through extensions).
So, to address your questions:
(Regarding your first 5 questions) By definition, C++17 parallel STL library does not define any computation, just data dependency, in order to allow for possible data flow transformations. All these questions will be answered (hopefully) by executor
, you can see the current proposal here. It will look something like:
executor = get_executor(); sort( std::execution::par.on(executor), vec.begin(), vec.end());
Some of your questions are already defined in that proposal.
(For the 6th) There are a number of libraries out there that already implement similar concepts (C++ executor
was inspired by some of them indeed), AFAIK: hpx, Thrust or Boost.Compute. I do not know how the last two are actually implemented, but for hpx they use lightweight threads and you can configure execution profile. Also, the expected (not yet standardized) syntax of the code above for C++17 is essentially the same as in (was heavily inspired by) hpx.
References:
Pre-final C++17 draft tells nothing about "how to implement the multi-threaded algorithms", that's true. Implementation owners decide on their own how to do that. E.g. Parallel STL uses TBB as a threading back-end and OpenMP as a vectorization back-end. I guess that to find out how does this implementation matches your machine - you need to read the implementation-specific documentation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With