Inconsistency when benchmarking two contiguous measurements

Tags:

I was benchmarking a function and I see that some iteration are slower than other.

After some tests I tried to benchmark two contiguous measurements and I still got some weird results.

The code is on wandbox.

For me the important part is :

using clock = std::chrono::steady_clock;
// ...
for (int i = 0; i < statSize; i++)
{
    auto t1 = clock::now();
    auto t2 = clock::now();
}

The loop is optimized away as we can see on godbolt.

call std::chrono::_V2::steady_clock::now()
mov r12, rax
call std::chrono::_V2::steady_clock::now()

The code was compiled with:

g++  bench.cpp  -Wall  -Wextra -std=c++11 -O3

and gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) on an Intel® Xeon® W-2195 Processor.

I was the only user on the machine and I try to run with and without higth priority (nice or chrt) and the result was the same.

The result I got with 100 000 000 iterations was:

100 000 000 iterations

The Y-axis is in nanoseconds, it's the result of the line

std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()

These 4 lines make me think of: No cache/L1/L2/L3 cache misses (even if the “L3 cache misses” line seems to be too close of the L2 line)

I am not sure why there will be cache misses, may be the storage of the result, but it’s not in the measured code.

I have try to run 10 000 times the program with a loop of 1500, because the L1 cache of this processor is:

lscpu | grep L1 
L1d cache:             32K
L1i cache:             32K

And 1500*16 bits = 24 000 bits which is less than 32K so there shouldn’t have cache miss.

And the results:

10 000 time the program with a loop of 1500

I still have my 4 lines (and some noise).

So if it’s realy a cache miss I don’t have any idea why it is happening.

I don’t kown if it’s useful for you but I run:

sudo perf stat -e cache-misses,L1-dcache-load-misses,L1-dcache-load  ./a.out 1000

With the value 1 000 / 10 000 / 100 000 / 1 000 000

I got between 4.70 and 4.30% of all L1-dcache hits, which seem pretty decent to me.

So the questions are:

What is the cause of these slowdown?
How produce a qualitative benchmark of a function when I can’t have a constant time for a No operation ?

Ps : I don’t kwon if I am missing useful informations / flags, feel free to ask !

How reproduce:

The code:

#include <iostream>
#include <chrono>
#include <vector>

int main(int argc, char **argv)
{
    int statSize = 1000;
    using clock = std::chrono::steady_clock;
    if (argc == 2)
    {
        statSize = std::atoi(argv[1]);
    }

    std::vector<uint16_t> temps;
    temps.reserve(statSize);
    for (int i = 0; i < statSize; i++)
    {

        auto t1 = clock::now();

        auto t2 = clock::now();

        temps.push_back(
            std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count());
    }

    for (auto t : temps)
        std::cout << (int)t << std::endl;

    return (0);
}

Build:

g++  bench.cpp  -Wall  -Wextra -std=c++11 -O3

Generate output (sudo needed):

In this case I run 10 000 time the program. Each time take 100 measures, and I remove the first which is always about 5 time slower :
```
 for i in {1..10000} ; do sudo nice -n -17 ./a.out 100 | tail -n 99  >> fast_1_000_000_uint16_100 ; done
```

Generate graph:

cat fast_1_000_000_uint16_100 | gnuplot -p -e "plot '<cat'"

The result I have on my machine:

enter image description here

Where I am after the answer of Zulan and all the comments

The current_clocksource is set on tsc and no switch seen in dmesg, command used:

dmesg -T | grep tsc

I use this script to remove the HyperThreading (HT) then

grep -c proc /proc/cpuinfo
=> 18

Subtract 1 from the last result to obtain the last available core:

=> 17

Edit /etc/grub/default and add isolcpus=(last result) in in GRUB_CMDLINE_LINUX:

GRUB_CMDLINE_LINUX="isolcpus=17"

Finaly:

sudo update-grub
reboot 
// reexecute the script

Now I can use:

taskset -c 17 ./a.out XXXX

So I run 10 000 times a loop of 100 iterations.

for i in {1..10000} ; do sudo /usr/bin/time -v taskset -c 17 ./a.out 100  > ./core17/run_$i 2>&1 ; done

Check if there is any Involuntary context switches:

grep -L "Involuntary context switches: 0" result/* | wc -l
=> 0

There is none, good. Let's plot :

for i in {1..10000} ; do cat ./core17/run_$i | head -n 99 >> ./no_switch_taskset ; done
cat no_switch_taskset | gnuplot -p -e "plot '<cat'"

Result :

22 fail (sore 1000 and more) in 1 000 000

There are still 22 measures greater than 1000 (when most values are arround 20) that I don't understand.

Next step, TBD

Do the part :

sudo nice -n -17 perf record...

of the Zulan answer's

967

asked Jun 18 '19 07:06

Martin Morterol

1 Answers

I can't reproduce it with these particular clustered lines, but here is some general information.

Possible causes

As discussed in the comments, nice on a normal idle system is just a best effort. You still have at least

The scheduling tick timer
Kernel tasks that are bound to a certain code
Your task may be migrated from one core to another for an arbitrary reason

You can use isolcpus and taskset to get exclusive cores for certain processes to avoid some of that, but I don't think you can really get rid of all the kernel tasks. In addition, use nohz=full to disable the scheduling tick. You should also disable hyperthreading to get exclusive access to a core from a hardware thread.

Except for taskset, which I absolutely recommend for any performance measurement, these are quite unusual measures.

Measure instead of guessing

If there is a suspicion what could be happening, you can usually setup a measurement to confirm or disprove that hypothesis. perf and tracepoints are great for that. For example, we can start with looking at scheduling activity and some interrupts:

sudo nice -n -17 perf record -o perf.data -e sched:sched_switch -e irq:irq_handler_entry -e irq:softirq_entry ./a.out ...

perf script will now tell list you every occurrence. To correlate that with slow iterations you can use perf probe and a slightly modified benchmark:

void __attribute__((optimize("O0"))) record_slow(int64_t count)
{
    (void)count;
}

...

    auto count = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    if (count > 100) {
        record_slow(count);
    }
    temps.push_back(count);

And compile with -g

sudo perf probe -x ./a.out record_slow count

Then add -e probe_a:record_slow to the call to perf record. Now if you are lucky, you find some close events, e.g.:

 a.out 14888 [005] 51213.829062:    irq:softirq_entry: vec=1 [action=TIMER]
 a.out 14888 [005] 51213.829068:  probe_a:record_slow: (559354aec479) count=9029

Be aware: while this information will likely explain some of your observation, you enter a world of even more puzzling questions and oddities. Also, while perf is pretty low-overhead, there may be some perturbation on what you measure.

What are we benchmarking?

First of all, you need to be clear what you actually measure: The time to execute std::chrono::steady_clock::now(). It's actually good to do that to figure out at least this measurement overhead as well as the precision of the clock.

That's actually a tricky point. The cost of this function, with clock_gettime underneath, depends on your current clocksource¹. If that's tsc you're fine - hpet is much slower. Linux may switch quietly² from tsc to hpet during operation.

What to do to get stable results?

Sometimes you might need to do benchmarks with extreme isolation, but usually that's not necessary even for very low-level micro-architecture benchmarks. Instead, you can use statistical effects: repeat the measurement. Use the appropriate methods (mean, quantiles), sometimes you may want to exclude outliers.

If the measurement kernel is not significantly longer than timer precision, you will have to repeat the kernel and measure outside to get a throughput rather than a latency, which may or may not be different.

Yes - benchmarking right is very complicated, you need to consider a lot of aspects, especially when you get closer to the hardware and the your kernel times get very short. Fortunately there's some help, for example Google's benchmark library provides a lot of help in terms of doing the right amount of repetitions and also in terms having experiment factors.

¹/sys/devices/system/clocksource/clocksource0/current_clocksource

² Actually it's in dmesg as something like

clocksource: timekeeping watchdog on CPU: Marking clocksource 'tsc' as unstable because the skew is too large:

answered Sep 30 '22 16:09

Zulan

Related questions
                            
                                What value should `std::stringstream::fail()` return after reading then writing? (gcc vs clang)
                            
                                In C++, does the size of an enumeration have to be equal to the size of its underlying type?
                            
                                Is the behaviour of std::get on rvalue-reference tuples dangerous?
                            
                                "Non-pointer POD" type in C++
                            
                                Rationale for the shared_ptr atomic functions taking pointer instead of reference
                            
                                Does a function pointer need to point to a function with external linkage when used as a template parameter?
                            
                                OpenMP offloading to Nvidia wrong reduction
                            
                                does c++ have an equivalent boost::numeric_cast<DestType>(SourceType)?
                            
                                Deleting boost interprocess_mutex when other process might be using it
                            
                                Is std::atomic<T> safe with interrupts when std::atomic<T>::is_always_lock_free is false?
                            
                                Why can't I bind a const lvalue reference to a function returning T&&?
                            
                                Use Universal Windows C# class library from Native C++
                            
                                How to choose the overloaded function called
                            
                                -Wshadow=global considers enum class entry shadowing a global. Why?
                            
                                constexpr parametrized function pointer
                            
                                Is there a way to use custom mangling in g++/clang++?
                            
                                Why does vulkan report a single device in a system with an integrated graphics card and a GPU?
                            
                                Why can static local object be accessed via pointer or reference from outside?
                            
                                C++ nested template issue
                            
                                Do pointer casts maintain alignment properties?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Inconsistency when benchmarking two contiguous measurements

Tags:

c++

performance

linux

caching

performance-testing