I was benchmarking a function and I see that some iteration are slower than other.
After some tests I tried to benchmark two contiguous measurements and I still got some weird results.
The code is on wandbox.
For me the important part is :
using clock = std::chrono::steady_clock;
// ...
for (int i = 0; i < statSize; i++)
{
auto t1 = clock::now();
auto t2 = clock::now();
}
The loop is optimized away as we can see on godbolt.
call std::chrono::_V2::steady_clock::now()
mov r12, rax
call std::chrono::_V2::steady_clock::now()
The code was compiled with:
g++ bench.cpp -Wall -Wextra -std=c++11 -O3
and gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
on an Intel® Xeon® W-2195 Processor.
I was the only user on the machine and I try to run with and without higth priority (nice
or chrt
) and the result was the same.
The result I got with 100 000 000 iterations was:
The Y-axis is in nanoseconds, it's the result of the line
std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
These 4 lines make me think of: No cache/L1/L2/L3 cache misses (even if the “L3 cache misses” line seems to be too close of the L2 line)
I am not sure why there will be cache misses, may be the storage of the result, but it’s not in the measured code.
I have try to run 10 000 times the program with a loop of 1500, because the L1 cache of this processor is:
lscpu | grep L1
L1d cache: 32K
L1i cache: 32K
And 1500*16 bits = 24 000 bits
which is less than 32K so there shouldn’t have cache miss.
And the results:
I still have my 4 lines (and some noise).
So if it’s realy a cache miss I don’t have any idea why it is happening.
I don’t kown if it’s useful for you but I run:
sudo perf stat -e cache-misses,L1-dcache-load-misses,L1-dcache-load ./a.out 1000
With the value 1 000 / 10 000 / 100 000 / 1 000 000
I got between 4.70 and 4.30% of all L1-dcache hits, which seem pretty decent to me.
So the questions are:
Ps : I don’t kwon if I am missing useful informations / flags, feel free to ask !
How reproduce:
The code:
#include <iostream>
#include <chrono>
#include <vector>
int main(int argc, char **argv)
{
int statSize = 1000;
using clock = std::chrono::steady_clock;
if (argc == 2)
{
statSize = std::atoi(argv[1]);
}
std::vector<uint16_t> temps;
temps.reserve(statSize);
for (int i = 0; i < statSize; i++)
{
auto t1 = clock::now();
auto t2 = clock::now();
temps.push_back(
std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count());
}
for (auto t : temps)
std::cout << (int)t << std::endl;
return (0);
}
Build:
g++ bench.cpp -Wall -Wextra -std=c++11 -O3
Generate output (sudo needed):
In this case I run 10 000 time the program. Each time take 100 measures, and I remove the first which is always about 5 time slower :
for i in {1..10000} ; do sudo nice -n -17 ./a.out 100 | tail -n 99 >> fast_1_000_000_uint16_100 ; done
Generate graph:
cat fast_1_000_000_uint16_100 | gnuplot -p -e "plot '<cat'"
The result I have on my machine:
The current_clocksource
is set on tsc
and no switch seen in dmesg
, command used:
dmesg -T | grep tsc
I use this script to remove the HyperThreading (HT) then
grep -c proc /proc/cpuinfo
=> 18
Subtract 1 from the last result to obtain the last available core:
=> 17
Edit /etc/grub/default and add isolcpus=(last result) in in GRUB_CMDLINE_LINUX:
GRUB_CMDLINE_LINUX="isolcpus=17"
Finaly:
sudo update-grub
reboot
// reexecute the script
Now I can use:
taskset -c 17 ./a.out XXXX
So I run 10 000 times a loop of 100 iterations.
for i in {1..10000} ; do sudo /usr/bin/time -v taskset -c 17 ./a.out 100 > ./core17/run_$i 2>&1 ; done
Check if there is any Involuntary context switches
:
grep -L "Involuntary context switches: 0" result/* | wc -l
=> 0
There is none, good. Let's plot :
for i in {1..10000} ; do cat ./core17/run_$i | head -n 99 >> ./no_switch_taskset ; done
cat no_switch_taskset | gnuplot -p -e "plot '<cat'"
Result :
There are still 22 measures greater than 1000 (when most values are arround 20) that I don't understand.
Do the part :
sudo nice -n -17 perf record...
of the Zulan answer's
Benchmarking requires an understanding of what is important to the organization (sometimes called critical success factors) and then measuring performance for these factors. The gap between actual performance and preferred achievement is typically analyzed to identify opportunities for improvement.
Benchmarking can allow you to:Gain an independent perspective about how well you perform compared to other companies. Drill down into performance gaps to identify areas for improvement. Develop a standardized set of processes and metrics. Enable a mindset and culture of continuous improvement.
Benchmarking, as the NCVO suggests, focuses on internal processes rather than on outcomes; it compares and assesses performance in order to achieve continuous improvement. Benchmarking enables organisations to compare themselves with competitors, discover what lessons to learn from others and what to focus on.
Three different types of benchmarking can be defined in this way: process, performance and strategic.
I can't reproduce it with these particular clustered lines, but here is some general information.
As discussed in the comments, nice on a normal idle system is just a best effort. You still have at least
The scheduling tick timer
Kernel tasks that are bound to a certain code
Your task may be migrated from one core to another for an arbitrary reason
You can use isolcpus
and taskset
to get exclusive cores for certain processes to avoid some of that, but I don't think you can really get rid of all the kernel tasks. In addition, use nohz=full
to disable the scheduling tick. You should also disable hyperthreading to get exclusive access to a core from a hardware thread.
Except for taskset
, which I absolutely recommend for any performance measurement, these are quite unusual measures.
If there is a suspicion what could be happening, you can usually setup a measurement to confirm or disprove that hypothesis. perf
and tracepoints are great for that. For example, we can start with looking at scheduling activity and some interrupts:
sudo nice -n -17 perf record -o perf.data -e sched:sched_switch -e irq:irq_handler_entry -e irq:softirq_entry ./a.out ...
perf script
will now tell list you every occurrence. To correlate that with slow iterations you can use perf probe
and a slightly modified benchmark:
void __attribute__((optimize("O0"))) record_slow(int64_t count)
{
(void)count;
}
...
auto count = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
if (count > 100) {
record_slow(count);
}
temps.push_back(count);
And compile with -g
sudo perf probe -x ./a.out record_slow count
Then add -e probe_a:record_slow
to the call to perf record
. Now if you are lucky, you find some close events, e.g.:
a.out 14888 [005] 51213.829062: irq:softirq_entry: vec=1 [action=TIMER]
a.out 14888 [005] 51213.829068: probe_a:record_slow: (559354aec479) count=9029
Be aware: while this information will likely explain some of your observation, you enter a world of even more puzzling questions and oddities. Also, while perf
is pretty low-overhead, there may be some perturbation on what you measure.
First of all, you need to be clear what you actually measure: The time to execute std::chrono::steady_clock::now()
. It's actually good to do that to figure out at least this measurement overhead as well as the precision of the clock.
That's actually a tricky point. The cost of this function, with clock_gettime
underneath, depends on your current clocksource1. If that's tsc
you're fine - hpet
is much slower. Linux may switch quietly2 from tsc
to hpet
during operation.
Sometimes you might need to do benchmarks with extreme isolation, but usually that's not necessary even for very low-level micro-architecture benchmarks. Instead, you can use statistical effects: repeat the measurement. Use the appropriate methods (mean, quantiles), sometimes you may want to exclude outliers.
If the measurement kernel is not significantly longer than timer precision, you will have to repeat the kernel and measure outside to get a throughput rather than a latency, which may or may not be different.
Yes - benchmarking right is very complicated, you need to consider a lot of aspects, especially when you get closer to the hardware and the your kernel times get very short. Fortunately there's some help, for example Google's benchmark library provides a lot of help in terms of doing the right amount of repetitions and also in terms having experiment factors.
1/sys/devices/system/clocksource/clocksource0/current_clocksource
2 Actually it's in dmesg
as something like
clocksource: timekeeping watchdog on CPU: Marking clocksource 'tsc' as unstable because the skew is too large:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With