I'm doing some Linux Kernel timings, specifically in the Interrupt Handling path. I've been using RDTSC for timings, however I recently learned it's not necessarily accurate as the instructions could be happening out of order.
I then tried:
RDTSC + CPUID (in reverse order, here) to flush the pipeline, and incurred up to a 60x overhead (!) on a Virtual Machine (my working environment) due to hypercalls and whatnot. This is both with and without HW Virtualization enabled.
Most recently I've come across the RDTSCP* instruction, which seems to do what RDTSC+CPUID did, but more efficiently as it's a newer instruction - only a 1.5x-2x overhead, relatively.
My question: Is RDTSCP truly accurate as a point of measurement, and is it the "correct" way of doing the timing?
Also to be more clear, my timing is essentially like this, internally:
*http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf page 27
A full discussion of the overhead you're seeing from the cpuid instruction is available at this stackoverflow thread. When using rdtsc, you need to use cpuid to ensure that no additional instructions are in the execution pipeline. The rdtscp instruction flushes the pipeline intrinsically. (The referenced SO thread also discusses these salient points, but I addressed them here because they're part of your question as well).
You only "need" to use cpuid+rdtsc if your processor does not support rdtscp. Otherwise, rdtscp is what you want, and will accurately give you the information you are after.
Both instructions provide you with a 64-bit, monotonically increasing counter that represents the number of cycles on the processor. If this is your pattern:
uint64_t s, e;
s = rdtscp();
do_interrupt();
e = rdtscp();
atomic_add(e - s, &acc);
atomic_add(1, &counter);
You may still have an off-by-one in your average measurement depending on where your read happens. For instance:
T1 T2
t0 atomic_add(e - s, &acc);
t1 a = atomic_read(&acc);
t2 c = atomic_read(&counter);
t3 atomic_add(1, &counter);
t4 avg = a / c;
It's unclear whether "[a]t the end" references a time that could race in this fashion. If so, you may want to calculate a running average or a moving average in-line with your delta.
Side-points:
--
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
s = rdtscp();
loop_body();
e = rdtscp();
acc += e - s;
}
printf("%"PRIu64"\n", (acc / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
While this will give you a decent idea of the overall performance in cycles of whatever is in loop_body()
, it defeats processor optimizations such as pipelining. In microbenchmarks, the processor will do a pretty good job of branch prediction in the loop, so measuring the loop overhead is fine. Doing it the way shown above is also bad because you end up with 2 pipeline stalls per loop iteration. Thus:
s = rdtscp();
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
loop_body();
}
e = rdtscp();
printf("%"PRIu64"\n", ((e-s) / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
Will be more efficient and probably more accurate in terms of what you'll see in Real Life versus what the previous benchmark would tell you.
The 2010 Intel paper How to Benchmark Code Execution Times on Intel ® IA-32 and IA-64 Instruction Set Architectures can be considered as outdated when it comes to its recommendations to combine RDTSC/RDTSCP with CPUID.
Current Intel reference documentation recommends fencing instructions as more efficient alternatives to CPUID:
Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.
(Intel® 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 8.2.5, September 2016)
If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.
(Intel RDTSC)
Thus, to get the TSC start value you execute this instruction sequence:
mfence
lfence
rdtsc
shl rdx, 0x20
or rax, rdx
At the end of your benchmark, to get the TSC stop value:
rdtscp
lfence
shl rdx, 0x20
or rax, rdx
Note that in contrast to CPUID, the lfence instruction doesn't clobber any registers, thus it isn't necessary to rescue the EDX:EAX
registers before executing the serializing instruction.
Relevant documentation snippet:
If software requires RDTSCP to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute LFENCE immediately after RDTSCP (Intel RDTSCP)
As an example how to integrate this into a C program, see also my GCC inline assembler implementations of the above operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With