With Visual Studio I can read the clock cycle count from the processor as shown below. How do I do the same thing with GCC?
#ifdef _MSC_VER // Compiler: Microsoft Visual Studio
#ifdef _M_IX86 // Processor: x86
inline uint64_t clockCycleCount()
{
uint64_t c;
__asm {
cpuid // serialize processor
rdtsc // read time stamp counter
mov dword ptr [c + 0], eax
mov dword ptr [c + 4], edx
}
return c;
}
#elif defined(_M_X64) // Processor: x64
extern "C" unsigned __int64 __rdtsc();
#pragma intrinsic(__rdtsc)
inline uint64_t clockCycleCount()
{
return __rdtsc();
}
#endif
#endif
Without this, RDTSC does count core clock cycles. nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX [8]. The TSC keeps ticking even in deep sleep C-states.
The RDTSC Performance Timer written in C++. The RDTSC is the IA-32/IA-64 (or x86/x64) instruction that loads the current value of the processors’ time stamp into EDX:EAX registers. RDTSC is short for “Read Time-Stamp Counter”. It returns the number of clock cycles since last reset.
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock ).
Generates the rdtsc instruction, which returns the processor time stamp. The processor time stamp records the number of clock cycles since the last reset. A 64-bit unsigned integer representing a tick count. This routine is available only as an intrinsic.
The other answers work, but you can avoid inline assembly by using GCC's __rdtsc
intrinsic, available by including x86intrin.h
.
It is defined at: gcc/config/i386/ia32intrin.h
:
/* rdtsc */
extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtsc (void)
{
return __builtin_ia32_rdtsc ();
}
On recent versions of Linux gettimeofday will incorporate nanosecond timings.
If you really want to call RDTSC you can use the following inline assembly:
http://www.mcs.anl.gov/~kazutomo/rdtsc.html
#if defined(__i386__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}
#elif defined(__x86_64__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif
Update: reposted and updated this answer on a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtsc
questions.
You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtsc
and rdtscp
, and (at least these days) all define a __rdtsc
intrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says #include <immintrin.h>
for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h
.)
#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
return __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer.
For more about using lfence
to improve repeatability of rdtsc
, see @HadiBrais' answer on clflush to invalidate cache line via C function.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)
rdtsc
counts reference cycles, not CPU core clock cyclesIt counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc
is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock
). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID
. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.
It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc()
, there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtsc
directly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram
on Linux.
It's at least as good as anything you could do with inline asm.
A non-inline version of it compiles MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax
, it's just rdtsc
/ret
. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or
+mov
instead of lea
to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
On Linux with gcc
, I use the following:
/* define this somewhere */
#ifdef __i386
__inline__ uint64_t rdtsc() {
uint64_t x;
__asm__ volatile ("rdtsc" : "=A" (x));
return x;
}
#elif __amd64
__inline__ uint64_t rdtsc() {
uint64_t a, d;
__asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
return (d<<32) | a;
}
#endif
/* now, in your function, do the following */
uint64_t t;
t = rdtsc();
// ... the stuff that you want to time ...
t = rdtsc() - t;
// t now contains the number of cycles elapsed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With