How to count clock cycles with RDTSC in GCC x86? [duplicate]

Tags:

With Visual Studio I can read the clock cycle count from the processor as shown below. How do I do the same thing with GCC?

#ifdef _MSC_VER             // Compiler: Microsoft Visual Studio

    #ifdef _M_IX86                      // Processor: x86

        inline uint64_t clockCycleCount()
        {
            uint64_t c;
            __asm {
                cpuid       // serialize processor
                rdtsc       // read time stamp counter
                mov dword ptr [c + 0], eax
                mov dword ptr [c + 4], edx
            }
            return c;
        }

    #elif defined(_M_X64)               // Processor: x64

        extern "C" unsigned __int64 __rdtsc();
        #pragma intrinsic(__rdtsc)
        inline uint64_t clockCycleCount()
        {
            return __rdtsc();
        }

    #endif

#endif

520

asked Mar 27 '12 10:03

4 Answers

The other answers work, but you can avoid inline assembly by using GCC's __rdtsc intrinsic, available by including x86intrin.h.

It is defined at: gcc/config/i386/ia32intrin.h:

/* rdtsc */
extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtsc (void)
{
  return __builtin_ia32_rdtsc ();
}

100

answered Oct 09 '22 23:10

Update: reposted and updated this answer on a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtsc questions.

You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm

Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says #include <immintrin.h> for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h.)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    return __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
}

Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer.

For more about using lfence to improve repeatability of rdtsc, see @HadiBrais' answer on clflush to invalidate cache line via C function.

See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)

`rdtsc` counts reference cycles, not CPU core clock cycles

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.

If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.

std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC

It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc(), there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtsc directly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux.

How good is the asm from using the intrinsic?

It's at least as good as anything you could do with inline asm.

A non-inline version of it compiles MSVC for x86-64 like this:

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.

In a test caller that uses it twice and subtracts to time an interval:

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

All 4 compilers make pretty similar code. This is GCC's 32-bit output:

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.

But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.

answered Oct 09 '22 23:10

Peter Cordes

On Linux with gcc, I use the following:

/* define this somewhere */
#ifdef __i386
__inline__ uint64_t rdtsc() {
  uint64_t x;
  __asm__ volatile ("rdtsc" : "=A" (x));
  return x;
}
#elif __amd64
__inline__ uint64_t rdtsc() {
  uint64_t a, d;
  __asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
  return (d<<32) | a;
}
#endif

/* now, in your function, do the following */
uint64_t t;
t = rdtsc();
// ... the stuff that you want to time ...
t = rdtsc() - t;
// t now contains the number of cycles elapsed

answered Oct 10 '22 00:10

a3nm

Related questions
                            
                                What does a leading :: mean in "using namespace ::X" in C++
                            
                                How to unset a specific bit in an integer
                            
                                Best way to safely printf to a string? [duplicate]
                            
                                std::forward_list and std::forward_list::push_back
                            
                                begin(), end() annoyance in STL algorithms
                            
                                Behaviour of arr[i] = i++ and i = i + 1 statements in C and C++
                            
                                How does =delete on destructor prevent stack allocation?
                            
                                Using new on void pointer
                            
                                Self numbers in c++
                            
                                System not declared in scope?
                            
                                unordered_multimap - iterating the result of find() yields elements with different value
                            
                                Confused with object arrays in C++
                            
                                Difference between the int * i and int** i
                            
                                C++ [Error] no matching function for call to
                            
                                To write a bootloader in C or C++?
                            
                                Why is a C++ bool var true by default?
                            
                                How to add all numbers in an array in C++?
                            
                                Multithreaded Rendering on OpenGL
                            
                                1e-9 or -1e9, which one is correct? [closed]
                            
                                i = ++i + ++i; in C++ [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count clock cycles with RDTSC in GCC x86? [duplicate]

Tags:

c++

c

x86

gcc

intel

rdtsc

Johan Råde

People also ask

4 Answers

Evan Shaw

Andrew Tomazos

`rdtsc` counts reference cycles, not CPU core clock cycles

How good is the asm from using the intrinsic?

Peter Cordes

a3nm

Recent Activity

Donate For Us

How to count clock cycles with RDTSC in GCC x86? [duplicate]

Tags:

c++

c

x86

gcc

intel

rdtsc

Johan Råde

People also ask

4 Answers

Evan Shaw

Andrew Tomazos

rdtsc counts reference cycles, not CPU core clock cycles

How good is the asm from using the intrinsic?

Peter Cordes

a3nm

Related questions

Recent Activity

Donate For Us

`rdtsc` counts reference cycles, not CPU core clock cycles