Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why perf has such high context-switches?

Tags:

c++

linux

perf

I was trying to understand the linux perf, and found some really confusing behavior:

I wrote a simple multi-threading example with one thread pinned to each core; each thread runs computation locally and does not communicate with each other (see test.cc below). I was thinking that this example should have really low, if not zero, context switches. However, using linux perf to profile the example shows thousands of context-switches - much more than what I expected. I further profiled the linux command sleep 20 for a comparison, showing much fewer context switches.

This profile result does not make any sense to me. What is causing so many context switches?

> sudo perf stat -e sched:sched_switch ./test
 Performance counter stats for './test':

                 6,725  sched:sched_switch                                          

      20.835 seconds time elapsed

> sudo perf stat -e sched:sched_switch sleep 20

 Performance counter stats for 'sleep 20':

                 1      sched:sched_switch                                          

      20.001 seconds time elapsed

For reproducing the results, please run the following code:

perf stat -e context-switches sleep 20
perf stat -e context-switches ./test

To compile the source code, please type the following code:

g++ -std=c++11 -pthread -o test test.cc
// test.cc
#include <iostream>
#include <thread>
#include <vector>

int main(int argc, const char** argv) {
  unsigned num_cpus = std::thread::hardware_concurrency();
  std::cout << "Launching " << num_cpus << " threads\n";

  std::vector<std::thread> threads(num_cpus);
  for (unsigned i = 0; i < num_cpus; ++i) {
    threads[i] = std::thread([i] {
      int j = 0;
      while (j++ < 100) {
        int tmp = 0;
        while (tmp++ < 110000000) { }
      }
    });

    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(i, &cpuset);
    int rc = pthread_setaffinity_np(threads[i].native_handle(),
                                    sizeof(cpu_set_t), &cpuset);
    if (rc != 0) {
      std::cerr << "Error calling pthread_setaffinity_np: " << rc << "\n";
    }
  }

  for (auto& t : threads) {
    t.join();
  }
  return 0;
}
like image 260
daoliker Avatar asked Sep 12 '18 15:09

daoliker


2 Answers

You can use sudo perf sched record -- ./test to determine which processes are being scheduled to run in place of the one of the threads of your application. When I execute this command on my system, I get:

sudo perf sched record -- ./test
Launching 4 threads
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 23.886 MB perf.data (212100 samples) ]

Note that I have four cores and the name of the executable is test. perf sched has captured all the sched:sched_switch events and dumped the data into a file called perf.data by default. The size of the file is about 23 MBs and contains abut 212100 events. The duration of profiling will be from the time perf starts until test terminates.

You can use sudo perf sched map to print all the recorded events in a nice format that looks like this:

             *.         448826.757400 secs .  => swapper:0
              *A0       448826.757461 secs A0 => perf:15875
          *.   A0       448826.757477 secs 
  *.       .   A0       448826.757548 secs 
   .       .  *B0       448826.757601 secs B0 => migration/3:22
   .       .  *.        448826.757608 secs 
  *A0      .   .        448826.757625 secs 
   A0     *C0  .        448826.757775 secs C0 => rcu_sched:7
   A0     *.   .        448826.757777 secs 
  *D0      .   .        448826.757803 secs D0 => ksoftirqd/0:3
  *A0      .   .        448826.757807 secs 
   A0 *E0  .   .        448826.757862 secs E0 => kworker/1:3:13786
   A0 *F0  .   .        448826.757870 secs F0 => kworker/1:0:5886
   A0 *G0  .   .        448826.757874 secs G0 => hud-service:1609
   A0 *.   .   .        448826.758614 secs 
   A0 *H0  .   .        448826.758714 secs H0 => kworker/u8:2:15585
   A0 *.   .   .        448826.758721 secs 
   A0  .  *I0  .        448826.758740 secs I0 => gnome-terminal-:8878
   A0  .   I0 *J0       448826.758744 secs J0 => test:15876
   A0  .   I0 *B0       448826.758749 secs 

The two-letter names A0, B0, C0, E0, and so on, are short names given by perf to every thread running on the system. The first four columns shows which thread was running on each of the four cores. For example, in the second-to-last row, you can see that the first thread that got created in your for loop. The name assigned to this thread is J0. The thread is running on the fourth core. The asterisk indicates that it has just been context-switched to from some other thread. Without an asterisk, it means that the same thread has continued to run on the same core for another time slice. A dot represents an idle core. To determine the names for all of the four threads, run the following command:

sudo perf sched map | grep 'test'

On my system, this prints:

   A0  .   I0 *J0       448826.758744 secs J0 => test:15876
   J0  A0 *K0  .        448826.758868 secs K0 => test:15878
   J0 *L0  K0  .        448826.758889 secs L0 => test:15877
   J0  L0  K0 *M0       448826.758894 secs M0 => test:15879

Now that you know the two-letter names assigned to your threads (and all other threads). you can determine which other threads are causing your threads to be context-switched. For example, if you see this:

  *G1  L0  K0  M0       448826.822555 secs G1 => firefox:2384

then you'd know that three of your app threads are running, but the one of the cores is being used to run Firefox. So the fourth thread needs to wait until the scheduler decides when to schedule again.

If you want all the scheduler slots where at least one of your threads is occupying, then you can use the following command:

sudo perf sched map > mydata
grep -E 'J0|K0|L0|M0' mydata > mydata2
wc -l mydata
wc -l mydata2

The last two commands tell you how many rows (time slices) where at least one thread of your app was running. You can compare that to the total number of time slices. Since there are four cores, the total number of scheduler slots is 4 * (number of time slices). Then you can do all sorts of manual calculations and figure out exactly what's happened.

like image 193
Hadi Brais Avatar answered Oct 15 '22 21:10

Hadi Brais


We can't tell you exactly what is being scheduled - but you can find out yourself using perf.

perf record -e sched:sched_switch ./test

Note this requires a mounted debugfs and root permissions. Now a perf report will give you an overview of what the scheduler was switching to (or see perf script for a full listing). Now there is no apparent thing in your code that would cause a context switch (e.g. sleep, waiting for I/O), so it is most likely another task that is being scheduled on these cores.

The reason why sleep has almost no context switches is simple. It goes to sleep almost immediately - which is one context switch. While the task is not active, it cannot be displaced by another task.

like image 31
Zulan Avatar answered Oct 15 '22 22:10

Zulan