Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does RDTSCP increment monotonically across multi-cores?

I'm confused whether rdtscp monotonically increments in a multi-core environment. According to the document: __rdtscp, rdtscp seems a processor-based instruction and can prevent reordering of instructions around the call.

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset.

rdtscp definitely increments monotonically on the same CPU core, but is this rdtscp timestamp guaranteed monotonic across different CPU cores? I believe there is no such absolute guarantee. For example,

Thread on CPU core#0                   Thread on CPU core#1

unsigned int ui;
uint64_t t11 = __rdtscp(&ui); 
uint64_t t12 = __rdtscp(&ui);  
uint64_t t13 = __rdtscp(&ui);         
                                       unsigned int ui;
                                       uint64_t t21 = __rdtscp(&ui);
                                       uint64_t t22 = __rdtscp(&ui);
                                       uint64_t t23 = __rdtscp(&ui);

By my understanding, we can have a decisive conclusion t13 > t12 > t11, but we cannot guarantee t21 > t13.

I want to write a script to test if my understanding is correct or not, but I don't know how to construct an example to validate my hypothesis.

// file name: rdtscptest.cpp
// g++ rdtscptest.cpp -g -lpthread -Wall -O0 -o run
#include <chrono>
#include <thread>
#include <iostream>
#include <string>
#include <string.h>
#include <vector>
#include <x86intrin.h>

using namespace std;

void test(int tid) {
    std::this_thread::sleep_for (std::chrono::seconds (tid));
    unsigned int ui;
    uint64_t tid_unique_ = __rdtscp(&ui);
    std::cout << "tid: " << tid << ", counter: " << tid_unique_ << ", ui: " << ui << std::endl;
    std::this_thread::sleep_for (std::chrono::seconds (1));
}

int main() {
    size_t trd_cnt = 3 ;
    std::vector<std::thread> threads(trd_cnt);

    for (size_t i=0; i< trd_cnt; i++) {
        // three threads with tid: 0, 1, 2
        // force different threads to run on different cpu cores
        threads[i] = std::thread(test, i);  
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(i, &cpuset);
        int rc = pthread_setaffinity_np(threads[i].native_handle(),
                                        sizeof(cpu_set_t), &cpuset);
        if (rc != 0) {
            std::cout << "Error calling pthread_setaffinity_np, code: " << rc << "\n";
        }
    }

    for (size_t i=0; i< trd_cnt; i++) {
        threads[i].join() ;
    }

    return 0;
}

So, two questions here:

  1. Is my understanding correct or not?
  2. How to construct an example to validate it?

==========updated, according to comments

__rdtscp will (always?) increment across cores on advanced cpus

like image 423
stickers Avatar asked Mar 02 '23 18:03

stickers


1 Answers

On most systems yes, if you create synchronization between threads to make sure that one actually does run after the other1. Otherwise all bets are off; starting one thread before another does not ensure that its code executes first.

Footnote 1: e.g. having one spin-wait until it sees an atomic store done by the other. Or use a mutex and run rdtscp in a critical section, along with a variable to record whether the other thread was already there.


On anything non-ancient (like Core2 and newer at least), TSC ticks at constant frequency (the "reference") frequency. See this answer for links and details about the constant_tsc / nonstop_tsc CPU features, and the possibility of TSC not being synced.

Most modern systems in practice do have the TSC synced between cores I think, thanks to motherboard vendors making sure that even on multi-socket systems the RESET signal is distributed to all cores at the same time. And firmware and OS software taking care not to screw it up. It's much easier on a single-socket system like a normal desktop with a multicore CPU where all the "extra" cores are on the same chip.

But this is not guaranteed, and part of why rdtscp exists (with a processor ID output) is this possibility (which I think might have been more common on older systems when RDTSCP was new).

There are even CPU features VMs can use to offset and scale the TSC transparently (with HW support), to migrate VMs between physical machines while preserving monotonicity and frequency of the TSC. Using these features indiscriminately can of course produce desynced TSCs or even ones that run at different frequencies on different cores.


TSC is a 64-bit counter that usually counts at the CPUs rated sticker frequency. This can be over ~4.2 GHz (2^32) on some CPUs, so that leaves the high half incrementing about once per second on fast CPUs. The TSC can in theory wrap if the computer has been "up" for over 2^32 seconds (several decades), or if the TSC has been manually set to have a big offset.

like image 85
Peter Cordes Avatar answered Mar 27 '23 02:03

Peter Cordes