Why does this simple lambda consistently run faster inside an std::thread than inside the main function with gcc 4.9.2?

Question

The following snippet takes a command line parameter that represents the number of threads to spawn to run a simple for loop concurrently.

If the arguments passed is 0, no std::thread is spawned.

On gcc 4.9.2, ./snippet 0 takes 10% longer than ./snippet 1 in average, i.e. the version that spawns one std::thread to execute the loop is faster than the version that just executes the loop inside main.

Does anyone know what's going on? clang-4 does not show this behaviour at all (version with one std::thread is slower), gcc 6.2 has the version with one std::thread run just slightly faster (when taking the minimum time spent over ten trials as the measured value).

Here is the snippet: ScopedNanoTimer is just a simple RAII timer. I am compiling with -g -O3 -pthread -std=c++11.

#include <thread>
#include <vector>

int main(int argc, char** argv) {

   // setup
   if (argc < 2)
      return 1;
   const unsigned n_threads = std::atoi(argv[1]);
   const auto n_iterations = 1000000000ul / (n_threads > 0u ? n_threads : n_threads + 1);

   // define workload
   auto task = [n_iterations]() {
      volatile auto sum = 0ul;
      for (auto i = 0ul; i < n_iterations; ++i) ++sum;
   };

   // time and print
   if (n_threads == 0) {
      task();
   } else {
      std::vector<std::thread> threads;
      for (auto i = 0u; i < n_threads; ++i) threads.emplace_back(task);
      for (auto &thread : threads) thread.join();
   }
   return 0;
}

EDIT

Following the suggestions in the comments I tried to obfuscate to the compiler the fact that the number of iterations was known a-priori for the logical branch in which n_threads == 0. I changed the relevant line to

const auto n_iterations = 1000000000ul / (n_threads > 0u ? n_threads : n_threads + 1);

I also removed the external for loop with 10 execution and all mentions of the ScopedNanoTimer. These changes are now reflected in the snippet above.

I compiled with the flags indicated above and executed several times on a workstation with Debian linux, kernel version 3.16.39-1+deb8u2, and processor Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, quad-core. All other programs were closed, cpu throttling/intel speed-step/turbo-boost were turned off and cpu governor policy was set to "performance". Internet connection was turned off.

The trend is always that compiling with gcc-4.9.2 the version with no std::threads is about 10% faster than the version that spawns one thread. clang-4, instead, has the opposite (expected) behaviour.

The following measurements convince me that the problem is in gcc-4.9.2 suboptimal optimisations, and not related to context switches nor poor quality of the measurements. With this said, not even godbolt's compiler explorer clearly shows me what gcc is doing so I do not consider the question answered.

Time+context switch measurements with g++-4.9.2

~$ g++ -std=c++11 -pthread -g -O3 snippet.cpp -o snippet_gcc
~$ for i in $(seq 1 10); do /usr/bin/time -v 2>&1 ./snippet_gcc 0 | egrep '((wall clock)|(switch))'; done
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
    Voluntary context switches: 1
    Involuntary context switches: 6
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
    Voluntary context switches: 1
    Involuntary context switches: 5
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
    Voluntary context switches: 1
    Involuntary context switches: 7
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
    Voluntary context switches: 1
    Involuntary context switches: 6
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.90
    Voluntary context switches: 1
    Involuntary context switches: 3
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
    Voluntary context switches: 1
    Involuntary context switches: 6
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
    Voluntary context switches: 1
    Involuntary context switches: 5
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
    Voluntary context switches: 1
    Involuntary context switches: 6
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
    Voluntary context switches: 1
    Involuntary context switches: 2
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
    Voluntary context switches: 1
    Involuntary context switches: 4
~$ for i in $(seq 1 10); do /usr/bin/time -v 2>&1 ./snippet_gcc 1 | egrep '((wall clock)|(switch))'; done
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.79
    Voluntary context switches: 2
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.95
    Voluntary context switches: 2
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.81
    Voluntary context switches: 2
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
    Voluntary context switches: 2
    Involuntary context switches: 5
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
    Voluntary context switches: 2
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.97
    Voluntary context switches: 2
    Involuntary context switches: 3
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
    Voluntary context switches: 2
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.85
    Voluntary context switches: 2
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
    Voluntary context switches: 2
    Involuntary context switches: 6
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.95
    Voluntary context switches: 2
    Involuntary context switches: 5

Time+context switch measurements with clang++-4.0

~$ clang++ -std=c++11 -pthread -g -O3 snippet.cpp -o snippet_clang
~$ for i in $(seq 1 10); do /usr/bin/time -v 2>&1 ./snippet_clang 0 | egrep '((wall clock)|(switch))'; done
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 6
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 5
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 7
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 3
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 3
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 1
    Involuntary context switches: 4
~$ for i in $(seq 1 10); do /usr/bin/time -v 2>&1 ./snippet_clang 1 | egrep '((wall clock)|(switch))'; done
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 6
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 6
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 5
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 5
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 2
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 3
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 4
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
    Voluntary context switches: 2
    Involuntary context switches: 7

Jayson Boubin · Accepted Answer

I think you may be the victim of a poor test sample. I attempted to reproduce this behavior, and after running each option 10 or so times for each option, I found that I was receiving times with relatively high variance. I ran some more tests using /usr/bin/time -v and found that the execution time of the program correlates pretty well with the number of involuntary context switches the program experienced.

Option 0: No threads
time,  context switches
20.32, 1806
20.09, 2139
21.01, 1916
21.13, 1873
21.15, 1847
18.67, 1617
19.06, 1692
17.94, 1546
21.40, 1867
18.64, 1629

Option 1: Threads
time,  context switches
19.68, 1750
19.60, 1740
19.35, 1783 
19.60, 1726
19.95, 1823
20.42, 1800
19.54, 1745
19.40, 1699
19.36, 1703

I think you may simply have run your benchmarks during periods of variable workload for your operating system. As you can see with my time data above, the times over 20 seconds were all gathered during periods of high load for the OS. Likewise, times under 19 seconds were gathered during times of low load.

Logically, I see why the loop which dispatches a thread should run slower. The overhead of creating a thread is high relative to the operation of the loop, which simply increments a number. This should cause an increase in the user-time necessary to run the program. The problem is, this user-time increase is probably negligible compared to the execution time of the entire loop. You are only creating 10 extra threads over the course of your programs life, and performing calculations within those threads should have very little difference if any from simply performing those calculations in the main. You are performing billions of other operations throughout the course of your program which hide user-time increase. If you really wanted to benchmark the creation time of a thread, you would write a program that creates lots of threads, and doesn't do much else. You should also be careful to run benchmarks like these in environments which have as few background processes as possible.

This may not be the entirety of the problem, but I believe it is still important to consider.

Why does this simple lambda consistently run faster inside an std::thread than inside the main function with gcc 4.9.2?

Tags:

c++

multithreading

gcc

c++11

lambda

blue

1 Answers

Jayson Boubin

Recent Activity

Donate For Us

Why does this simple lambda consistently run faster inside an std::thread than inside the main function with gcc 4.9.2?

Tags:

c++

multithreading

gcc

c++11

lambda

blue

1 Answers

Jayson Boubin

Related questions

Recent Activity

Donate For Us