Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistency when benchmarking two contiguous measurements

I was benchmarking a function and I see that some iteration are slower than other.

After some tests I tried to benchmark two contiguous measurements and I still got some weird results.

The code is on wandbox.

For me the important part is :

using clock = std::chrono::steady_clock;
// ...
for (int i = 0; i < statSize; i++)
{
    auto t1 = clock::now();
    auto t2 = clock::now();
}

The loop is optimized away as we can see on godbolt.

call std::chrono::_V2::steady_clock::now()
mov r12, rax
call std::chrono::_V2::steady_clock::now()

The code was compiled with:

g++  bench.cpp  -Wall  -Wextra -std=c++11 -O3

and gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) on an Intel® Xeon® W-2195 Processor.

I was the only user on the machine and I try to run with and without higth priority (nice or chrt) and the result was the same.

The result I got with 100 000 000 iterations was:

100 000 000 iterations

The Y-axis is in nanoseconds, it's the result of the line

std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()

These 4 lines make me think of: No cache/L1/L2/L3 cache misses (even if the “L3 cache misses” line seems to be too close of the L2 line)

I am not sure why there will be cache misses, may be the storage of the result, but it’s not in the measured code.

I have try to run 10 000 times the program with a loop of 1500, because the L1 cache of this processor is:

lscpu | grep L1 
L1d cache:             32K
L1i cache:             32K

And 1500*16 bits = 24 000 bits which is less than 32K so there shouldn’t have cache miss.

And the results:

10 000 time the program with a loop of 1500

I still have my 4 lines (and some noise).

So if it’s realy a cache miss I don’t have any idea why it is happening.

I don’t kown if it’s useful for you but I run:

sudo perf stat -e cache-misses,L1-dcache-load-misses,L1-dcache-load  ./a.out 1000

With the value 1 000 / 10 000 / 100 000 / 1 000 000

I got between 4.70 and 4.30% of all L1-dcache hits, which seem pretty decent to me.

So the questions are:

  • What is the cause of these slowdown?
  • How produce a qualitative benchmark of a function when I can’t have a constant time for a No operation ?

Ps : I don’t kwon if I am missing useful informations / flags, feel free to ask !


How reproduce:

  1. The code:

    #include <iostream>
    #include <chrono>
    #include <vector>
    
    int main(int argc, char **argv)
    {
        int statSize = 1000;
        using clock = std::chrono::steady_clock;
        if (argc == 2)
        {
            statSize = std::atoi(argv[1]);
        }
    
        std::vector<uint16_t> temps;
        temps.reserve(statSize);
        for (int i = 0; i < statSize; i++)
        {
    
            auto t1 = clock::now();
    
            auto t2 = clock::now();
    
            temps.push_back(
                std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count());
        }
    
        for (auto t : temps)
            std::cout << (int)t << std::endl;
    
        return (0);
    }
    
  2. Build:

    g++  bench.cpp  -Wall  -Wextra -std=c++11 -O3
    
  3. Generate output (sudo needed):

    In this case I run 10 000 time the program. Each time take 100 measures, and I remove the first which is always about 5 time slower :

     for i in {1..10000} ; do sudo nice -n -17 ./a.out 100 | tail -n 99  >> fast_1_000_000_uint16_100 ; done
    
  4. Generate graph:

    cat fast_1_000_000_uint16_100 | gnuplot -p -e "plot '<cat'"
    
  5. The result I have on my machine:

enter image description here


Where I am after the answer of Zulan and all the comments

The current_clocksource is set on tsc and no switch seen in dmesg, command used:

dmesg -T | grep tsc

I use this script to remove the HyperThreading (HT) then

grep -c proc /proc/cpuinfo
=> 18

Subtract 1 from the last result to obtain the last available core:

=> 17

Edit /etc/grub/default and add isolcpus=(last result) in in GRUB_CMDLINE_LINUX:

GRUB_CMDLINE_LINUX="isolcpus=17"

Finaly:

sudo update-grub
reboot 
// reexecute the script

Now I can use:

taskset -c 17 ./a.out XXXX

So I run 10 000 times a loop of 100 iterations.

for i in {1..10000} ; do sudo /usr/bin/time -v taskset -c 17 ./a.out 100  > ./core17/run_$i 2>&1 ; done

Check if there is any Involuntary context switches:

grep -L "Involuntary context switches: 0" result/* | wc -l
=> 0 

There is none, good. Let's plot :

for i in {1..10000} ; do cat ./core17/run_$i | head -n 99 >> ./no_switch_taskset ; done
cat no_switch_taskset | gnuplot -p -e "plot '<cat'"

Result :

22 fail (sore 1000 and more) in 1 000 000

There are still 22 measures greater than 1000 (when most values are arround 20) that I don't understand.

Next step, TBD

Do the part :

sudo nice -n -17 perf record...

of the Zulan answer's

like image 967
Martin Morterol Avatar asked Jun 18 '19 07:06

Martin Morterol


People also ask

How does benchmarking link with performance measurement?

Benchmarking requires an understanding of what is important to the organization (sometimes called critical success factors) and then measuring performance for these factors. The gap between actual performance and preferred achievement is typically analyzed to identify opportunities for improvement.

Why benchmarking is important for performance measurement?

Benchmarking can allow you to:Gain an independent perspective about how well you perform compared to other companies. Drill down into performance gaps to identify areas for improvement. Develop a standardized set of processes and metrics. Enable a mindset and culture of continuous improvement.

How benchmarking can be used to improve efficiency and effectiveness?

Benchmarking, as the NCVO suggests, focuses on internal processes rather than on outcomes; it compares and assesses performance in order to achieve continuous improvement. Benchmarking enables organisations to compare themselves with competitors, discover what lessons to learn from others and what to focus on.

What are the 3 benchmarks?

Three different types of benchmarking can be defined in this way: process, performance and strategic.


1 Answers

I can't reproduce it with these particular clustered lines, but here is some general information.

Possible causes

As discussed in the comments, nice on a normal idle system is just a best effort. You still have at least

  1. The scheduling tick timer

  2. Kernel tasks that are bound to a certain code

  3. Your task may be migrated from one core to another for an arbitrary reason

You can use isolcpus and taskset to get exclusive cores for certain processes to avoid some of that, but I don't think you can really get rid of all the kernel tasks. In addition, use nohz=full to disable the scheduling tick. You should also disable hyperthreading to get exclusive access to a core from a hardware thread.

Except for taskset, which I absolutely recommend for any performance measurement, these are quite unusual measures.

Measure instead of guessing

If there is a suspicion what could be happening, you can usually setup a measurement to confirm or disprove that hypothesis. perf and tracepoints are great for that. For example, we can start with looking at scheduling activity and some interrupts:

sudo nice -n -17 perf record -o perf.data -e sched:sched_switch -e irq:irq_handler_entry -e irq:softirq_entry ./a.out ...

perf script will now tell list you every occurrence. To correlate that with slow iterations you can use perf probe and a slightly modified benchmark:

void __attribute__((optimize("O0"))) record_slow(int64_t count)
{
    (void)count;
}

...

    auto count = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    if (count > 100) {
        record_slow(count);
    }
    temps.push_back(count);

And compile with -g

sudo perf probe -x ./a.out record_slow count

Then add -e probe_a:record_slow to the call to perf record. Now if you are lucky, you find some close events, e.g.:

 a.out 14888 [005] 51213.829062:    irq:softirq_entry: vec=1 [action=TIMER]
 a.out 14888 [005] 51213.829068:  probe_a:record_slow: (559354aec479) count=9029

Be aware: while this information will likely explain some of your observation, you enter a world of even more puzzling questions and oddities. Also, while perf is pretty low-overhead, there may be some perturbation on what you measure.

What are we benchmarking?

First of all, you need to be clear what you actually measure: The time to execute std::chrono::steady_clock::now(). It's actually good to do that to figure out at least this measurement overhead as well as the precision of the clock.

That's actually a tricky point. The cost of this function, with clock_gettime underneath, depends on your current clocksource1. If that's tsc you're fine - hpet is much slower. Linux may switch quietly2 from tsc to hpet during operation.

What to do to get stable results?

Sometimes you might need to do benchmarks with extreme isolation, but usually that's not necessary even for very low-level micro-architecture benchmarks. Instead, you can use statistical effects: repeat the measurement. Use the appropriate methods (mean, quantiles), sometimes you may want to exclude outliers.

If the measurement kernel is not significantly longer than timer precision, you will have to repeat the kernel and measure outside to get a throughput rather than a latency, which may or may not be different.

Yes - benchmarking right is very complicated, you need to consider a lot of aspects, especially when you get closer to the hardware and the your kernel times get very short. Fortunately there's some help, for example Google's benchmark library provides a lot of help in terms of doing the right amount of repetitions and also in terms having experiment factors.

1/sys/devices/system/clocksource/clocksource0/current_clocksource

2 Actually it's in dmesg as something like

clocksource: timekeeping watchdog on CPU: Marking clocksource 'tsc' as unstable because the skew is too large:

like image 52
Zulan Avatar answered Sep 30 '22 16:09

Zulan