How do the C++ STL (ExecutionPolicy) algorithms determine how many parallel threads to use?

Tags:

C++17 upgraded 69 STL algorithms to support parallelism, by the use of an optional ExecutionPolicy parameter (as the 1st argument). eg.

std::sort(std::execution::par, begin(v), end(v));

I suspect the C++17 standard deliberately says nothing about how to implement the multi-threaded algorithms, leaving it up to the library writers to decide what is best (and allowing them to change their minds, later). Still, I'm keen to understand at a high level what issues are being considered in the implementation of the parallel STL algorithms.

Some questions on my mind include (but are not limited to!):

How is the maximum number of threads used (by the C++ application) related to the number of CPU &/or GPU cores on the machine?
What differences are there in the number of threads each algorithm uses? (Will each algorithm always use the same number of threads in every circumstance?)
Is there any consideration given to other parallel STL calls on other threads (within the same app)? (eg. if a thread invokes std::for_each(par,...), will it use more/less/same threads depending on if a std::sort(par, ...) is already running on some other thread(s)? Is there a thread pool perhaps?)
Is any consideration given to how busy the cores are due to external factors? (eg. if 1 core is very busy, say analysing SETI signals, will the C++ application reduce the number of threads it uses?)
Do some algorithms only use CPU cores? or only GPU cores?
I suspect implementations will vary from library to library (compiler to compiler?), even details about this would be interesting.

I realise the point of these parallel algorithms is to shield the Programmer from having to worry about these details. However, any info that gives me a high-level mental picture of what's going on inside the library calls would be appreciated.

262

asked Oct 31 '17 05:10

Scott Smedley

2 Answers

Most of these questions can not be answered by the standard as of today. However, your question, as I understand it, mixes two concepts:

C1. Constraints on parallel algorithms

C2. Execution of algorithms

All the C++17 parallel STL thing is about C1: it sets constraints on how instructions and/or threads could be interleaved/transformed in a parallel computation. On the other hand, C2 is about being standardized, the keyword is executor (more on this later).

For C1, there are 3 standard policies (in std::execution::seq, par and par_unseq) that correspond to every combination of task and instruction parallelism. For example, when performing an integer accumulation, par_unseq could be used, since the order is not important. However, for float point arithmetic, where addition is not associative, a better fit would be seq to, at least, get a deterministic result. In short: policies set constraints on parallel computation and these constraints could be potentially exploited by a smart compiler.

On the other hand, once you have a parallel algorithm and its constraints (and possibly after some optimization/transformation), the executor will find a way to execute it. There are default executors (for CPU for example) or you can create your own, then, all that configuration regarding number of threads, workload, processing unit, etc... can be set.

As of today, C1 is in the standard, but not C2, so if you use C1 with a compliant compiler, you will not be able to specify which execution profile you want and the library implementation will decide for you (maybe through extensions).

So, to address your questions:

(Regarding your first 5 questions) By definition, C++17 parallel STL library does not define any computation, just data dependency, in order to allow for possible data flow transformations. All these questions will be answered (hopefully) by executor, you can see the current proposal here. It will look something like:

executor = get_executor(); sort( std::execution::par.on(executor), vec.begin(), vec.end());

Some of your questions are already defined in that proposal.

(For the 6th) There are a number of libraries out there that already implement similar concepts (C++ executor was inspired by some of them indeed), AFAIK: hpx, Thrust or Boost.Compute. I do not know how the last two are actually implemented, but for hpx they use lightweight threads and you can configure execution profile. Also, the expected (not yet standardized) syntax of the code above for C++17 is essentially the same as in (was heavily inspired by) hpx.

References:

C++17 Parallel Algorithms and Beyond by Bryce Adelstein lelbach
The future of ISO C++ Heterogeneous Computing by Michael Wong
Keynote C++ executors to enable heterogeneous computing in tomorrow's C++ today by Michael Wong
Executors for C++ - A Long Story by Detlef Vollmann

answered Sep 20 '22 21:09

Fusho

Pre-final C++17 draft tells nothing about "how to implement the multi-threaded algorithms", that's true. Implementation owners decide on their own how to do that. E.g. Parallel STL uses TBB as a threading back-end and OpenMP as a vectorization back-end. I guess that to find out how does this implementation matches your machine - you need to read the implementation-specific documentation

answered Sep 19 '22 21:09

Rostislav Povelikin

Related questions
                            
                                How to indent after access modifiers with clang-format
                            
                                Using std::array and using "array" as name
                            
                                Should the implementation guard itself against comma overloading?
                            
                                How can I link with (or work around) two third-party static libraries that define the same symbols?
                            
                                Is the Visual C++ implementation of std::async using a thread pool legal
                            
                                Xcode refuses to build one of my OpenCL projects but builds another one successfully
                            
                                Fixed-width Floating-Point Numbers in C/C++
                            
                                Can GCC be coerced to generate efficient constructors for memory-aligned objects?
                            
                                Is memcpy of a trivially-copyable type construction or assignment?
                            
                                In C++, can a C-style cast invoke a conversion function and then cast away constness?
                            
                                How to Implement Tab Completion
                            
                                How can I automatically generate / update header files from cpp files without using an IDE?
                            
                                Is it legal to declare a constexpr initializer_list object?
                            
                                memcpy/memmove to a union member, does this set the 'active' member?
                            
                                Ambiguous when two superclasses have a member function with the same name, but different signatures
                            
                                Should I return an rvalue reference parameter by rvalue reference?
                            
                                GC.AddMemoryPressure() not enough to trigger the Finalizer queue execution on time
                            
                                How does Stroustrup take a non-const reference to a temporary?
                            
                                How to avoid this sentence is false in a template SFINAE?
                            
                                Can clang-format align a block of #defines for me?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do the C++ STL (ExecutionPolicy) algorithms determine how many parallel threads to use?

Tags:

c++

multithreading

concurrency

parallel-processing

stl

Scott Smedley

People also ask

2 Answers

Fusho

Rostislav Povelikin

Recent Activity

Donate For Us