I am trying to profile a code for execution time on an x86-64 processor. I am referring to this Intel white paper and also gone through other SO threads discussing the topic of using RDTSCP vs CPUID+RDTSC here and here.
In the above mentioned whitepaper, the method using CPUID+RDTSC is termed unreliable and also proven using the statistics.
What might be the reason for the CPUID+RDTSC being unreliable?
Also, the graphs in Figure 1(Minimum value Behavior graph) and Figure 2 (Variance Behavior graph) in the same white paper have got a "Square wave" pattern. What explains such a pattern?
I think they're finding that CPUID inside the measurement interval causes extra variability in the total time. Their proposed fix in 3.2 Improvements Using RDTSCP Instruction highlights the fact that there's no CPUID inside the timed interval when they use CPUID
/ RDTSC
to start, and RDTSCP
/CPUID
to stop.
Perhaps they could have ensured EAX=0 or EAX=1 before executing CPUID, to choose which CPUID leaf of data to read (http://www.sandpile.org/x86/cpuid.htm#level_0000_0000h), in case CPUID time taken depends on which query you make. Other than that, I'm unsure why that would be.
Or better, use lfence
instead of cpuid
to serialize OoO exec without being a full serializing operation.
Note that the inline asm in Intel's whitepaper sucks: there's no need for those mov
instructions if you use proper output constraints like "=a"(low), "=d"(high)
. See How to get the CPU cycle count in x86_64 from C++? for better ways.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With