I'm confused whether rdtscp
monotonically increments in a multi-core environment. According to the document: __rdtscp, rdtscp
seems a processor-based instruction and can prevent reordering of instructions around the call.
The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset.
rdtscp
definitely increments monotonically on the same CPU core, but is this rdtscp
timestamp guaranteed monotonic across different CPU cores? I believe there is no such absolute guarantee. For example,
Thread on CPU core#0 Thread on CPU core#1
unsigned int ui;
uint64_t t11 = __rdtscp(&ui);
uint64_t t12 = __rdtscp(&ui);
uint64_t t13 = __rdtscp(&ui);
unsigned int ui;
uint64_t t21 = __rdtscp(&ui);
uint64_t t22 = __rdtscp(&ui);
uint64_t t23 = __rdtscp(&ui);
By my understanding, we can have a decisive conclusion t13 > t12 > t11
, but we cannot guarantee t21 > t13
.
I want to write a script to test if my understanding is correct or not, but I don't know how to construct an example to validate my hypothesis.
// file name: rdtscptest.cpp
// g++ rdtscptest.cpp -g -lpthread -Wall -O0 -o run
#include <chrono>
#include <thread>
#include <iostream>
#include <string>
#include <string.h>
#include <vector>
#include <x86intrin.h>
using namespace std;
void test(int tid) {
std::this_thread::sleep_for (std::chrono::seconds (tid));
unsigned int ui;
uint64_t tid_unique_ = __rdtscp(&ui);
std::cout << "tid: " << tid << ", counter: " << tid_unique_ << ", ui: " << ui << std::endl;
std::this_thread::sleep_for (std::chrono::seconds (1));
}
int main() {
size_t trd_cnt = 3 ;
std::vector<std::thread> threads(trd_cnt);
for (size_t i=0; i< trd_cnt; i++) {
// three threads with tid: 0, 1, 2
// force different threads to run on different cpu cores
threads[i] = std::thread(test, i);
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(threads[i].native_handle(),
sizeof(cpu_set_t), &cpuset);
if (rc != 0) {
std::cout << "Error calling pthread_setaffinity_np, code: " << rc << "\n";
}
}
for (size_t i=0; i< trd_cnt; i++) {
threads[i].join() ;
}
return 0;
}
So, two questions here:
==========updated, according to comments
__rdtscp will (always?) increment across cores on advanced cpus
On most systems yes, if you create synchronization between threads to make sure that one actually does run after the other1. Otherwise all bets are off; starting one thread before another does not ensure that its code executes first.
Footnote 1: e.g. having one spin-wait until it sees an atomic store done by the other. Or use a mutex and run rdtscp
in a critical section, along with a variable to record whether the other thread was already there.
On anything non-ancient (like Core2 and newer at least), TSC ticks at constant frequency (the "reference") frequency. See this answer for links and details about the constant_tsc
/ nonstop_tsc
CPU features, and the possibility of TSC not being synced.
Most modern systems in practice do have the TSC synced between cores I think, thanks to motherboard vendors making sure that even on multi-socket systems the RESET signal is distributed to all cores at the same time. And firmware and OS software taking care not to screw it up. It's much easier on a single-socket system like a normal desktop with a multicore CPU where all the "extra" cores are on the same chip.
But this is not guaranteed, and part of why rdtscp
exists (with a processor ID output) is this possibility (which I think might have been more common on older systems when RDTSCP was new).
There are even CPU features VMs can use to offset and scale the TSC transparently (with HW support), to migrate VMs between physical machines while preserving monotonicity and frequency of the TSC. Using these features indiscriminately can of course produce desynced TSCs or even ones that run at different frequencies on different cores.
TSC is a 64-bit counter that usually counts at the CPUs rated sticker frequency. This can be over ~4.2 GHz (2^32) on some CPUs, so that leaves the high half incrementing about once per second on fast CPUs. The TSC can in theory wrap if the computer has been "up" for over 2^32 seconds (several decades), or if the TSC has been manually set to have a big offset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With