Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are C++17 Parallel Algorithms implemented already?

I was trying to play around with the new parallel library features proposed in the C++17 standard, but I couldn't get it to work. I tried compiling with the up-to-date versions of g++ 8.1.1 and clang++-6.0 and -std=c++17, but neither seemed to support #include <execution>, std::execution::par or anything similar.

When looking at the cppreference for parallel algorithms there is a long list of algorithms, claiming

Technical specification provides parallelized versions of the following 69 algorithms from algorithm, numeric and memory: ( ... long list ...)

which sounds like the algorithms are ready 'on paper', but not ready to use yet?

In this SO question from over a year ago the answers claim these features hadn't been implemented yet. But by now I would have expected to see some kind of implementation. Is there anything we can use already?

like image 970
Romeo Valentin Avatar asked Jun 25 '18 20:06

Romeo Valentin


People also ask

Does C support parallel execution?

C++17 added support for parallel algorithms to the standard library, to help programs take advantage of parallel execution for improved performance.

Where are parallel algorithms used?

Message Passing Model. Message passing is the most commonly used parallel programming approach in distributed memory systems. Here, the programmer has to determine the parallelism. In this model, all the processors have their own local memory unit and they exchange data through a communication network.

Can an algorithm have parallel processes?

An algorithm is a sequence of steps that take inputs from the user and after some computation, produces an output. A parallel algorithm is an algorithm that can execute several instructions simultaneously on different processing devices and then combine all the individual outputs to produce the final result.


2 Answers

GCC 9 has them but you have to install TBB separately

In Ubuntu 19.10, all components have finally aligned:

  • GCC 9 is the default one, and the minimum required version for TBB
  • TBB (Intel Thread Building Blocks) is at 2019~U8-1, so it meets the minimum 2018 requirement

so you can simply do:

sudo apt install gcc libtbb-dev g++ -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -o main.out main.cpp -ltbb ./main.out 

and use as:

#include <execution> #include <algorithm>  std::sort(std::execution::par_unseq, input.begin(), input.end()); 

see also the full runnable benchmark below.

GCC 9 and TBB 2018 are the first ones to work as mentioned in the release notes: https://gcc.gnu.org/gcc-9/changes.html

Parallel algorithms and <execution> (requires Thread Building Blocks 2018 or newer).

Related threads:

  • How to install TBB from source on Linux and make it work
  • trouble linking INTEL tbb library

Ubuntu 18.04 installation

Ubuntu 18.04 is a bit more involved:

  • GCC 9 can be obtained from a trustworthy PPA, so it is not so bad
  • TBB is at version 2017, which does not work, and I could not find a trustworthy PPA for it. Compiling from source is easy, but there is no install target which is annoying...

Here are fully automated tested commands for Ubuntu 18.04:

# Install GCC 9 sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install gcc-9 g++-9  # Compile libtbb from source. sudo apt-get build-dep libtbb-dev git clone https://github.com/intel/tbb cd tbb git checkout 2019_U9 make -j `nproc` TBB="$(pwd)" TBB_RELEASE="${TBB}/build/linux_intel64_gcc_cc7.4.0_libc2.27_kernel4.15.0_release"  # Use them to compile our test program. g++-9 -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -I "${TBB}/include" -L  "${TBB_RELEASE}" -Wl,-rpath,"${TBB_RELEASE}" -o main.out main.cpp -ltbb ./main.out 

Test program analysis

I have tested with this program that compares the parallel and serial sorting speed.

main.cpp

#include <algorithm> #include <cassert> #include <chrono> #include <execution> #include <random> #include <iostream> #include <vector>  int main(int argc, char **argv) {     using clk = std::chrono::high_resolution_clock;     decltype(clk::now()) start, end;     std::vector<unsigned long long> input_parallel, input_serial;     unsigned int seed;     unsigned long long n;      // CLI arguments;     std::uniform_int_distribution<uint64_t> zero_ull_max(0);     if (argc > 1) {         n = std::strtoll(argv[1], NULL, 0);     } else {         n = 10;     }     if (argc > 2) {         seed = std::stoi(argv[2]);     } else {         seed = std::random_device()();     }      std::mt19937 prng(seed);     for (unsigned long long i = 0; i < n; ++i) {         input_parallel.push_back(zero_ull_max(prng));     }     input_serial = input_parallel;      // Sort and time parallel.     start = clk::now();     std::sort(std::execution::par_unseq, input_parallel.begin(), input_parallel.end());     end = clk::now();     std::cout << "parallel " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;      // Sort and time serial.     start = clk::now();     std::sort(std::execution::seq, input_serial.begin(), input_serial.end());     end = clk::now();     std::cout << "serial " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;      assert(input_parallel == input_serial); } 

On Ubuntu 19.10, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads, 2.90 GHz base, 8 MB cache), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB, 2400 Mbps) a typical output for an input with 100 million numbers to be sorted:

./main.out 100000000 

was:

parallel 2.00886 s serial 9.37583 s 

so the parallel version was about 4.5 times faster! See also: What do the terms "CPU bound" and "I/O bound" mean?

We can confirm that the process is spawning threads with strace:

strace -f -s999 -v ./main.out 100000000 |& grep -E 'clone' 

which shows several lines of type:

[pid 25774] clone(strace: Process 25788 attached [pid 25774] <... clone resumed> child_stack=0x7fd8c57f4fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd8c57f59d0, tls=0x7fd8c57f5700, child_tidptr=0x7fd8c57f59d0) = 25788 

Also, if I comment out the serial version and run with:

time ./main.out 100000000 

I get:

real    0m5.135s user    0m17.824s sys     0m0.902s 

which confirms again that the algorithm was parallelized since real < user, and gives an idea of how effectively it can be parallelized in my system (about 3.5x for 8 cores).

Error messages

Google, index this please.

If you don't have tbb installed, the error is:

In file included from /usr/include/c++/9/pstl/parallel_backend.h:14,                  from /usr/include/c++/9/pstl/algorithm_impl.h:25,                  from /usr/include/c++/9/pstl/glue_execution_defs.h:52,                  from /usr/include/c++/9/execution:32,                  from parallel_sort.cpp:4: /usr/include/c++/9/pstl/parallel_backend_tbb.h:19:10: fatal error: tbb/blocked_range.h: No such file or directory    19 | #include <tbb/blocked_range.h>       |          ^~~~~~~~~~~~~~~~~~~~~ compilation terminated. 

so we see that <execution> depends on an uninstalled TBB component.

If TBB is too old, e.g. the default Ubuntu 18.04 one, it fails with:

#error Intel(R) Threading Building Blocks 2018 is required; older versions are not supported. 

You can refer https://en.cppreference.com/w/cpp/compiler_support to check all C++ feature implementation status. For your case, just search "Standardization of Parallelism TS", and you will find only MSVC and Intel C++ compilers support this feature now.

like image 23
Nan Xiao Avatar answered Oct 20 '22 05:10

Nan Xiao