Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(Un-)Deterministic CPU behavior and reasoning about (physical) execution duration

In the past I have dealt with time-critical software development. The development of these applications basically proceeded as follows: "let's write the code, test latency and jitter, and optimize both until they are in the acceptable range." I find that highly frustrating; it isn't what I call proper engineering and I want to do better.

So I looked into the question: why do we have jitter at all? And the answers, of course, are:

  • caching: getting a piece of code or data from main memory takes some 2 orders of magnitude more time than getting the same data from L1 cache. So the physical execution time depends on what is in the cache. Which, in turn, depends on several factors:
    • code and data layout of the application: we all know the horrifying row- vs. column-major matrix traversal example
    • caching strategy of the CPU, including speculative pre-fetching of cache lines
    • other processes on the same core doing stuff
  • branch prediction: the CPU tries to guess which branch of a conditional jump will be executed. Even if the same conditional jump is executed twice, the prediction may differ and hence a "pipeline bubble" might form one time, but not the other.
  • Interrupts: asynchronous behavior obviously leads to jitter
  • Frequency scaling: thankfully disabled in real-time systems

That's a lot of things that can interfere with the behavior of a piece of code. Nonetheless: if I have two instructions, located on the same cache line, not depending on any data and not containing (conditional) jumps. Then the jitter from caching and branch prediction should be eliminated, and only interrupts should play a role. Right? Well, I wrote a small program getting the Time Stamp Counter (tsc) twice, and writing the difference to stdout. I executed it on a rt-patched linux kernel with frequency scaling disabled.

The code has glibc-based init and cleanup, and calls printf which I presume is sometimes in cache and sometimes isn't. But between the calls to "rdtsc" (writing the tsc into edx:eax), everything should be deterministic over each execution of the binary. Just to be sure, I disassembled the elf file, here the part with the two rdtsc calls:

00000000000006b0 <main>:
 6b0:   0f 31                   rdtsc  
 6b2:   48 c1 e2 20             shl    $0x20,%rdx
 6b6:   48 09 d0                or     %rdx,%rax
 6b9:   48 89 c6                mov    %rax,%rsi
 6bc:   0f 31                   rdtsc  
 6be:   48 c1 e2 20             shl    $0x20,%rdx
 6c2:   48 09 d0                or     %rdx,%rax
 6c5:   48 29 c6                sub    %rax,%rsi
 6c8:   e8 01 00 00 00          callq  6ce <print_rsi>
 6cd:   c3                      retq 

No conditional jumps, located on the same cache line (although I am not 100% sure about that - where exactly does the elf loader put the instructions? Do 64 byte boundaries here map to 64 byte boundaries in memory?)... where is the jitter coming from? If I execute that code 1000 times (via zsh, re-starting the program each time), I get values from 12 to 46 with several values in between. As frequency scaling is disabled in my kernel, that leaves interrupts. Now I am willing to believe that, out of 1000 executions, one is interrupted. I am not prepared to believe that 90% are interrupted (we are talking about ns-intervals here! Where should the interrupts come from?!).

So my questions are these:

  • why is the code not deterministic, i.e., why don't I get the same number with every run?
  • is it possible to reason about the running time, at least of this very simple piece of code, at all? Is there at least a bound on the running time that I can guarantee (using engineering principles, not measurement combined with hope)?
  • if there isn't, what exactly is the source of the non-deterministic behavior? What component of the CPU (or the rest of the computer?) is rolling the dice here?
like image 804
D. Veloper Avatar asked Apr 30 '17 15:04

D. Veloper


2 Answers

Once you remove the external sources of jitter, CPUs are still not entirely deterministic - at least based on the factors you can control.

More to the point, you seem to be operating under a model where each instruction executes serially, taking a certain about of time. Of course, modern out-of-order CPUs are generally executing more than one instruction at once, and in general may reorder the instruction stream such that instructions are executing 200+ or more instructions in front of the oldest unexecuted instruction.

In that model, it is hard to say exactly where an instruction begins or ends (it is when it is decoded, executed, retired, or something else) and it is certainly tough for "timing" instructions to have a reasonable cycle-accurate interpretation while participating in this highly parallel pipeline.

Since the rdstc doesn't serialize the pipeline, the timings you get may be quite random, even if the process is totally deterministic - it will depend entirely on the other instructions in the pipeline and so on. The second call to rdtsc is never going to have the same pipeline state as the first, and the initial pipeline state is going to be different as well.

The usual solution here is to issue a cpuid instruction prior to issuing a rdstc, but some refinements have been discussed.

If you want a good model of how a piece CPU bound code operates1, you can get most of the way there by reading the first three guides on Agner Fog's optimization page (skip the C++ if you are only interesting in assembly level), as well as What every programmer should know about memory. There is a PDF version of the latter that might be easier to read.

That will allow to take a piece of code and model how it will perform, without every running it. I've done it and sometimes received cycle-accurate results for my effort. In other cases, the results are slower than the model would predict and you have to dig around to understand what other bottleneck you are hitting - and occasionally you will discover something entirely undocumented about an architecture!

If you just want cycle-accurate (or nearly so) timings for short segments of code, I recommend libpfc which on x86 gives you userland access to the performance counters and claims cycle accurate results under the right conditions (basically you have the pin the process to a CPU and prevent context switches, which it seems you are likely already doing). The perf counters can give you better results than rdstc.

Finally, note that rdtsc is measuring wall clock time, which is fundamentally different than CPU cycles on nearly all modern cores with DVFS. As the CPU slows down, your apparent measured cost will increase and vice-versa. This also adds some slowdown to the instruction itself which has to go out and read a counter tied to a clock domain different than the CPU clock.


1 That is, one which is bound my computation, memory access and so on - and not by IO, user input, external devices etc.

like image 199
BeeOnRope Avatar answered Oct 28 '22 00:10

BeeOnRope


The loader has placed the instructions in the addresses that you see on the left. I do not know whether the cache works on physical or logical addresses, but that's irrelevant, because the granularity of the mapping between physical and logical addresses is quite coarse, (at least 4k if I am not mistaken,) and in any case it is certain to be a multiple of the size of a cache line. So, you probably have one cache line boundary at address 680, and then the next one is at address 6C0, so you are most probably fine with respect to cache lines.

If your code had been pre-empted by an interrupt then one of your readings would have probably been off by hundreds, possibly thousands of cycles, not by tens of cycles as you witnessed. So that's not it, either.

Besides the factors that you have identified, there are many more that can influence the readings:

  • DMA access being done on behalf of another thread
  • The state of the CPU pipeline
  • CPU register allocation

CPU register allocation is of specific interest, because it gives an idea of how complicated modern CPUs are, and therefore how difficult it is to predict how much time is going to be taken by any given instruction. The registers that you are using are not real registers; they are "virtual" in a way. The CPU contains an internal bank of general purpose registers, and it assigns some of them to your thread, mapping them to whatever you want to think of as "rax" or "rdx". The complexity of this is mind-boggling.

At the end of the day, what you are discovering is that CPU timing is (not really, but) practically non-deterministic in x86-x64-based modern desktop systems. That's to be expected.

Luckily these systems are so fast that it hardly ever matters, and when it does matter, we do not use a desktop system, we use an embedded system.

And for those who have an academic need for predictable instruction execution time, there are emulators, which sum the number of clock cycles taken, according to the book, by each emulated instruction. Those are absolutely deterministic.

like image 23
Mike Nakis Avatar answered Oct 27 '22 22:10

Mike Nakis