I've written the following program to play around with std::chrono for a bit:
#include <iostream>
#include <chrono>
int main(int argc, char** argv) {
const long iterationCount = 10000;
long count = 0;
for(long i = 0; i < iterationCount; i++) {
auto start = std::chrono::high_resolution_clock::now();
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
auto end = std::chrono::high_resolution_clock::now();
auto timeTaken = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << timeTaken << " " << std::endl;
}
}
I've compiled this using G++ without compiler optimisations enabled:
g++ chrono.cpp -o chrono
I subsequently ran this program a few times, and got an interesting pattern. For the first 500-1000 iterations the program runs about 7-8 times slower than it does for the remainder of iterations.
Here's a sample output of this program: https://pastebin.com/tUQsQEAQ
What causes this discrepancy in runtimes? My first reaction was the cache, but would that one not become saturated very quickly?
In case it matters, my operating system is Ubuntu 18.04, and my g++ version is 7.3.0.
After an micro-architectural implementation defined time and if the CPU can find same thermal power headroom, frequency scaling kicks in to speed up the clock of the demanding core up to a maximum (within the limits of the TDP).
Intel's implementation si called Turbo boost.
If you disable frequency scaling in your system (e.g. with sudo cpupower frequency-set --governor performance
- cpupower
is in the cpupowerutils
package) the time of each iteration is more or less the same.
The loop itself is very easy to predict you can expect only a few mispredictions if not only one at the end from the loop control - the code inside the C++ library is likely harder to predict but even with that it won't take that long (1000 iterations) for the BPU to catch up with your code.
So you can rule out branch misprediction.
More or less the same apply for the I-cache (there is little use of the D-cache, unless the C++ library implementation was heavy on variable usage).
The code should be small enough to fit in the L1-I and mostly of it even in the DSB.
L1-I misses don't take 1000 iteration to resolve and if there would be an heavy set conflict in the cache that would show up as a general slow down that won't go away after 1000 iteration.
Generally speaking it is a know effect that code run faster from the second iteration of a loop carried dependency chain because the first time the CPU fills the caches (data, instructions, TLB).
It can eventually slow down again if the CPU runs out of resources, for example if there is a lot of port pressure (e.g. lots of identical limited-port long-latency instructions) the RS may fill up stalling the FE or if there is a lot of load/store filling the MOB/SB/LB or a lot of jump filling the BOB.
However these effects kicks in very quickly to the point that they dominate the run time of the code.
In this case the slowdown happens very very late (in CPU times) that make it desirable to think about on-demand process - like Turbo boost.
As others say, it is almost sure that you experience the effect of dynamic CPU frequency scaling.
I can reproduce your results on my machine. But, if I turn off dynamic CPU frequency scaling with the cpufreq-set
utility (to make the CPU to run at a constant frequency), the effect you see goes away.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With