Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accuracy of rdtsc for benchmarking and time stamp counter frequency

As part of a benchmarking task, I was investigating the different mechanisms that can be used to measure elapsed time. I've finalised on using clock_gettime, but I did do sufficient research and testing with the RDTSC instruction as well. I have several question regarding the same (based on what I read on several online threads):

  • On newer processors (>Pentium 4), the TSC ticks at the maximum frequency of the CPU on the system. Is this correct? In that case is it valid to use the number of ticks and frequency to determine the time?

  • If the above is true, it means TSC is unaffected by changes in CPU frequency due to power saving and other features. Knowing this, would it mean the total ticks obtained by using RDTSC are NOT the actual ticks used by the sampled piece of code - since the code would run at the frequency of the CPU and not that of the TSC? In addition, does this mean the time obtained by using the TSC ticks and CPU frequency isn't the actual time used by the code piece?

  • I'm finding lots of different statements about the synchronizing of the TSC value across cores (see this thread). I'm not sure what is correct and I'm guessing it depends on the processor model as well. But can it be assumed to be synchronized among cores on newer CPUs? (This is without using sched_set_affinity)?

Do note that I'm not using RDTSC due to the various problems associated with it (portability, reliability etc). These questions are just to improve my understanding of how TSC works and of benchmarking in general.

like image 769
Cygnus Avatar asked Sep 14 '15 17:09

Cygnus


People also ask

What is rdtsc instruction?

Microsoft Specific. Generates the rdtsc instruction, which returns the processor time stamp. The processor time stamp records the number of clock cycles since the last reset.

What is TSC frequency?

It counts the number of clock signals arriving on the CLK pin of the processor. The current counter value can be read by accessing the TSC register. The number of ticks counted per second can be calculated as 1/(clock frequency); for a 1 GHz clock it translates to once every nanosecond.

What is TSC processor?

The Time Stamp Counter (TSC) is a 64-bit register present on all x86 processors since the Pentium. It counts the number of CPU cycles since its reset. The instruction RDTSC returns the TSC in EDX:EAX. In x86-64 mode, RDTSC also clears the upper 32 bits of RAX and RDX.

What is rdtsc in Linux?

rdtsc is a time stamp counter that returns the number of clock ticks from the time the system was last reset. The rdtsc instruction returns the time stamp counter in EDX:EAX.


1 Answers

Invariant TSC means, according to Intel,

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states.

But what rate is that? Well,

That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted. The maximum resolved frequency may differ from the maximum qualified frequency of the processor, see Section 18.14.5 for more detail. On certain processors, the TSC frequency may not be the same as the frequency in the brand string.

Looks to me as though they wanted it to be the frequency from the brand string, but then somehow didn't always get it right.. What is that frequency though?

The TSC, IA32_MPERF, and IA32_FIXED_CTR2 operate at the same, maximum-resolved frequency of the platform, which is equal to the product of scalable bus frequency and maximum resolved bus ratio.
For processors based on Intel Core microarchitecture, the scalable bus frequency is encoded in the bit field MSR_FSB_FREQ[2:0] at (0CDH), see Appendix B, "Model-Specific Registers (MSRs)". The maximum resolved bus ratio can be read from the following bit field:
If XE operation is disabled, the maximum resolved bus ratio can be read in MSR_PLATFORM_ID[12:8]. It corresponds to the maximum qualified frequency.
If XE operation is enabled, the maximum resolved bus ratio is given in MSR_PERF_STAT[44:40], it corresponds to the maximum XE operation frequency configured by BIOS.

That's probably not very helpful though. TL;DR, finding the TSC rate programmatically is too much effort. You can of course easily find it on your own system, just get an inaccurate guess based on a timed loop and take the "nearest number that makes sense". It's probably the number from the brand string anyway. It has been on all systems I've tested it on, but I haven't tested that many. And if it isn't, then it'll be some significantly differing rate, so you will definitely know.

In addition, does this mean the time obtained by using the TSC ticks and CPU frequency isn't the actual time used by the code piece?

Yes however not all hope is lost, the time obtained by using TSC ticks and the TSC rate (if you somehow know it) will give the actual time .. almost? Here usually a lot of FUD about unreliability is spouted. Yes, RDTSC is not serializing (but you can add serializing instructions). RDTSCP is serializing, but in some ways not quite enough (it can't execute too early, but it can execute too late). But it's not like you can't use them, you can either accept a small error, or read the paper I linked below.

But can it be assumed to be synchronized among cores on newer CPUs?

Yes, no, maybe - it will be synchronized, unless the TSC is written to. Who knows, someone might do it. Out of your control. It also won't be synchronized across different sockets.

Finally, I don't really buy the FUD about RDTSC(P) in the context of benchmarking. You can serialize it as much as you need, TSC is invariant, and you know the rate because it's your system. There isn't really any alternative either, it's basically the source of high resolution time measurement that in the end everything else ends up using anyway. Even without special precautions (but with filtering of your data) the accuracy and precision are fine for most benchmarks, and if you need more then read How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures, they write a kernel module so they can get rid of two other sources of benchmark error that are subject to much FUD, preemptions and interrupts.

like image 190
harold Avatar answered Sep 25 '22 00:09

harold