I was trying to understand the linux perf
, and found some really confusing behavior:
I wrote a simple multi-threading example with one thread pinned to each core; each thread runs computation locally and does not communicate with each other (see test.cc
below). I was thinking that this example should have really low, if not zero, context switches. However, using linux perf
to profile the example shows thousands of context-switches - much more than what I expected. I further profiled the linux command sleep 20
for a comparison, showing much fewer context switches.
This profile result does not make any sense to me. What is causing so many context switches?
> sudo perf stat -e sched:sched_switch ./test
Performance counter stats for './test':
6,725 sched:sched_switch
20.835 seconds time elapsed
> sudo perf stat -e sched:sched_switch sleep 20
Performance counter stats for 'sleep 20':
1 sched:sched_switch
20.001 seconds time elapsed
For reproducing the results, please run the following code:
perf stat -e context-switches sleep 20
perf stat -e context-switches ./test
To compile the source code, please type the following code:
g++ -std=c++11 -pthread -o test test.cc
// test.cc
#include <iostream>
#include <thread>
#include <vector>
int main(int argc, const char** argv) {
unsigned num_cpus = std::thread::hardware_concurrency();
std::cout << "Launching " << num_cpus << " threads\n";
std::vector<std::thread> threads(num_cpus);
for (unsigned i = 0; i < num_cpus; ++i) {
threads[i] = std::thread([i] {
int j = 0;
while (j++ < 100) {
int tmp = 0;
while (tmp++ < 110000000) { }
}
});
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(threads[i].native_handle(),
sizeof(cpu_set_t), &cpuset);
if (rc != 0) {
std::cerr << "Error calling pthread_setaffinity_np: " << rc << "\n";
}
}
for (auto& t : threads) {
t.join();
}
return 0;
}
You can use sudo perf sched record -- ./test
to determine which processes are being scheduled to run in place of the one of the threads of your application. When I execute this command on my system, I get:
sudo perf sched record -- ./test
Launching 4 threads
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 23.886 MB perf.data (212100 samples) ]
Note that I have four cores and the name of the executable is test
. perf sched
has captured all the sched:sched_switch
events and dumped the data into a file called perf.data
by default. The size of the file is about 23 MBs and contains abut 212100 events. The duration of profiling will be from the time perf
starts until test
terminates.
You can use sudo perf sched map
to print all the recorded events in a nice format that looks like this:
*. 448826.757400 secs . => swapper:0
*A0 448826.757461 secs A0 => perf:15875
*. A0 448826.757477 secs
*. . A0 448826.757548 secs
. . *B0 448826.757601 secs B0 => migration/3:22
. . *. 448826.757608 secs
*A0 . . 448826.757625 secs
A0 *C0 . 448826.757775 secs C0 => rcu_sched:7
A0 *. . 448826.757777 secs
*D0 . . 448826.757803 secs D0 => ksoftirqd/0:3
*A0 . . 448826.757807 secs
A0 *E0 . . 448826.757862 secs E0 => kworker/1:3:13786
A0 *F0 . . 448826.757870 secs F0 => kworker/1:0:5886
A0 *G0 . . 448826.757874 secs G0 => hud-service:1609
A0 *. . . 448826.758614 secs
A0 *H0 . . 448826.758714 secs H0 => kworker/u8:2:15585
A0 *. . . 448826.758721 secs
A0 . *I0 . 448826.758740 secs I0 => gnome-terminal-:8878
A0 . I0 *J0 448826.758744 secs J0 => test:15876
A0 . I0 *B0 448826.758749 secs
The two-letter names A0
, B0
, C0
, E0
, and so on, are short names given by perf
to every thread running on the system. The first four columns shows which thread was running on each of the four cores. For example, in the second-to-last row, you can see that the first thread that got created in your for
loop. The name assigned to this thread is J0
. The thread is running on the fourth core. The asterisk indicates that it has just been context-switched to from some other thread. Without an asterisk, it means that the same thread has continued to run on the same core for another time slice. A dot represents an idle core. To determine the names for all of the four threads, run the following command:
sudo perf sched map | grep 'test'
On my system, this prints:
A0 . I0 *J0 448826.758744 secs J0 => test:15876
J0 A0 *K0 . 448826.758868 secs K0 => test:15878
J0 *L0 K0 . 448826.758889 secs L0 => test:15877
J0 L0 K0 *M0 448826.758894 secs M0 => test:15879
Now that you know the two-letter names assigned to your threads (and all other threads). you can determine which other threads are causing your threads to be context-switched. For example, if you see this:
*G1 L0 K0 M0 448826.822555 secs G1 => firefox:2384
then you'd know that three of your app threads are running, but the one of the cores is being used to run Firefox. So the fourth thread needs to wait until the scheduler decides when to schedule again.
If you want all the scheduler slots where at least one of your threads is occupying, then you can use the following command:
sudo perf sched map > mydata
grep -E 'J0|K0|L0|M0' mydata > mydata2
wc -l mydata
wc -l mydata2
The last two commands tell you how many rows (time slices) where at least one thread of your app was running. You can compare that to the total number of time slices. Since there are four cores, the total number of scheduler slots is 4 * (number of time slices). Then you can do all sorts of manual calculations and figure out exactly what's happened.
We can't tell you exactly what is being scheduled - but you can find out yourself using perf
.
perf record -e sched:sched_switch ./test
Note this requires a mounted debugfs and root permissions. Now a perf report
will give you an overview of what the scheduler was switching to (or see perf script
for a full listing). Now there is no apparent thing in your code that would cause a context switch (e.g. sleep, waiting for I/O), so it is most likely another task that is being scheduled on these cores.
The reason why sleep
has almost no context switches is simple. It goes to sleep almost immediately - which is one context switch. While the task is not active, it cannot be displaced by another task.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With