Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creation of std::thread slows down main program by 50%

merely creating a thread and joining it slows down execution of the main thread by 50%. As you can see in the example below the thread does nothing and still has a significant effect on performance. I thought it might be a power/frequency scaling related issue, so I tried to sleep after creating the thread to no avail. The program below if compiled with

g++ -std=c++11 -o out thread_test.cpp -pthread

shows the result of

Before thread() trial 0 time: 312024526 ignore -1593025974
Before thread() trial 1 time: 243018707 ignore -494037597
Before thread() trial 2 time: 242929293 ignore 177714863
Before thread() trial 3 time: 242935290 ignore 129069571
Before thread() trial 4 time: 243113945 ignore 840242475
Before thread() trial 5 time: 242824224 ignore -1635749271
Before thread() trial 6 time: 242809490 ignore -1256215542
Before thread() trial 7 time: 242910180 ignore -555222712
Before thread() trial 8 time: 235645414 ignore 537501443
Before thread() trial 9 time: 235746347 ignore 118363977
After thread() trial 0 time: 567509646 ignore 223146324
After thread() trial 1 time: 476450035 ignore -393907838
After thread() trial 2 time: 476377789 ignore -1678874628
After thread() trial 3 time: 476377012 ignore -1015350122
After thread() trial 4 time: 476185152 ignore 2034280344
After thread() trial 5 time: 476420949 ignore -1647334529
After thread() trial 6 time: 476354679 ignore 441573900
After thread() trial 7 time: 476120322 ignore -1576726357
After thread() trial 8 time: 476464850 ignore -895798632
After thread() trial 9 time: 475996533 ignore -997590921

whereas all trials should be the same speed.

EDIT: use rdtsc() for time measurement, use larger duration, use computed result

thread_test.cpp:

#include <ctime>
#include <thread>
#include <iostream>

int dorands(){
  int a =0;
  for(int i=0; i<10000000; i++){
   a +=rand();
  }
  return a;
}

inline uint64_t rdtsc(){
  uint32_t lo, hi;
  __asm__ __volatile__ (
    "xorl %%eax, %%eax\n"
    "cpuid\n"
    "rdtscp\n"
    : "=a" (lo), "=d" (hi)
    :
    : "%ebx", "%ecx" );
  return (uint64_t)hi << 32 | lo;
}


int foo(){return 0;}

int main(){

  uint64_t begin;
  uint64_t end;

  for(int i = 0; i< 10; i++){
    begin= rdtsc();
    volatile int e = dorands();
    end = rdtsc();
    std::cout << "Before thread() trial "<<i<<" time: " << end-begin << " ignore " << e << std::endl;;
  }

  std::thread t1(foo);
  t1.join();

  for(int i = 0; i< 10; i++){
    begin= rdtsc();
    volatile int e = dorands();
    end = rdtsc();
    std::cout << "After thread() trial "<<i<<" time: " << end-begin << " ignore " << e << std::endl;;
  }

  return 1;
}
like image 668
hlitz Avatar asked Feb 19 '15 17:02

hlitz


2 Answers

std::rand() is C rand() which, under glibc, invokes __random(). __random() invokes __libc_lock_lock() and __libc_lock_unlock(), and I don't think it's a stretch to imagine that if we delve deeper into that code, we'll find that the locks are essentially a no-op until a thread is created.

like image 177
Lightness Races in Orbit Avatar answered Nov 07 '22 02:11

Lightness Races in Orbit


I think you're running into a basic problem: at least on a typical multitasking operating system, there's a range from around (say) a few millseconds up to a second or so within which it's difficult to get meaningful timing measurements.

For extremely short sequences, you can use a clock-counter (e.g., RDTSC on x86), and run it a few times. If a task switch happens during the run, it'll stick out really badly, as a run that takes many times longer than the rest.

That points to the real problem: as soon as you get to a sequence (such as yours) that takes long enough that it's almost certain that at least one task switch will happen while it's running, then you run into a problem: the time lost to task switching can throw the timing off significantly. In particular, if a task switch happens during one run, but not during another, it can make the second appear significantly faster than the first.

Eventually, you get to tasks that take long enough that all of them include several task switches, so the difference due to the number of task switches gets pretty much lost in the noise.

Note: in theory, clock is supposed to measure only CPU time, not wall-clock time. In reality, it's nearly impossible to completely factor out all task switching time.

Your test demonstrates (or may demonstrate, anyway) another fairly basic problem: your dorand() computes something, but doesn't (for example) print out the result. A sufficiently intelligent compiler may (easily) be able to deduce that it has essentially no effect, and basically factor it out completely.

Even if you print out the results from dorand, you haven't seeded the random number generator, so it's required to produce identical results on each run. Again, a sufficiently intelligent compiler could figure that out, and compute the proper result at compile time, and just print out the three correct results. To prevent that we could (as one possibility) seed the random number differently on each run--the usual way is to retrieve the current time, and pass that to srand.

To eliminate (or at least reduce) those problems, we could rewrite the code something like this:

#include <ctime>
#include <thread>
#include <iostream>

long long int dorands(){
  long long int a =0;
  for(int i=0; i<100000000; i++){
    a +=rand();
  }
  return a;
}

int foo(){return 0;}

int main(){
    srand(time(NULL));
  clock_t begin = clock();
  long long int e = dorands();
  clock_t end = clock();
  std::cout << "ignore: " << e << ", trial 1 time: " << end-begin << std::endl;;

  begin = clock();
  e = dorands();
  end = clock();
  std::cout << "ignore: " << e << ", trial 2 time: " << end - begin << std::endl;;

  std::thread t1(foo);
  t1.join();

  begin = clock();
  e = dorands();
  end = clock();
  std::cout << "ignore: " << e << ", trial 3 time: " << end - begin << std::endl;;

  begin = clock();
  e = dorands();
  end = clock();
  std::cout << "ignore: " << e << ", trial 4 time: " << end - begin << std::endl;;


  return 1;
}

Here I've printed out the value returned from dorand, so the compiler can't just skip doing the calls to rand completely. I've also increased the number inside dorand so each trial runs for at least a second (on my computer, they do anyway).

Running it, I get results like this:

ignore: 1638407535924, trial 1 time: 1519
ignore: 1638386748597, trial 2 time: 1455
ignore: 1638433228933, trial 3 time: 1433
ignore: 1638288863328, trial 4 time: 1491

In this particular run, the first trials are slower (on average) than the second trials, but there's enough variation and overlap that we're probably pretty safe guessing that it's just noise--if there's any real difference in average speed, it's much too small for us to measure.

like image 22
Jerry Coffin Avatar answered Nov 07 '22 01:11

Jerry Coffin