Section Old Question contains the initial question (Further Investigation and Conclusion have been added since).
Skip to the section Further Investigation below for a detailed comparison of the different timing methods (rdtsc
, clock_gettime
and QueryThreadCycleTime
).
I believe the erratic behaviour of CGT can be attributed to either a buggy kernel or a buggy CPU (see section Conclusion).
The code used for testing is at the bottom of this question (see section Appendix).
Apologies for the length.
In short: I am using clock_gettime
to measure the execution time of many code segments. I am experiencing very inconsistent measurements between separate runs. The method has an extremely high standard deviation when compared to other methods (see Explanation below).
Question: Is there a reason why clock_gettime
would give so inconsistent measurements when compared to other methods? Is there an alternative method with the same resolution that accounts for thread idle time?
Explanation: I am trying to profile a number of small parts of C code. The execution time of each of the code segments is not more than a couple of microseconds. In a single run, each of the code segments will execute some hundreds of times, which produces runs × hundreds
of measurements.
I also have to measure only the time the thread actually spends executing (which is why rdtsc
is not suitable). I also need a high resolution (which is why times
is not suitable).
I've tried the following methods:
rdtsc
(on Linux and Windows),
clock_gettime
(with 'CLOCK_THREAD_CPUTIME_ID'; on Linux), and
QueryThreadCycleTime
(on Windows).
Methodology: The analysis was performed on 25 runs. In each run, separate code segments repeat a 101 of times. Therefore I have 2525 measurements. Then I look at a histogram of the measurements, and also calculate some basic stuff (like the mean, std.dev., median, mode, min, and max).
I do not present how I measured the 'similarity' of the three methods, but this simply involved a basic comparison of proportion of times spent in each code segment ('proportion' means that the times are normalised). I then look at the pure differences in these proportions. This comparison showed that all 'rdtsc', 'QTCT', and 'CGT' measure the same proportions when averaged over the 25 runs. However, the results below show that 'CGT' has a very large standard deviation. This makes it unusable in my use case.
Results:
A comparison of clock_gettime
with rdtsc
for the same code segment (25 runs of 101 measurements = 2525 readings):
clock_gettime:
the rest is between 900 and 5000 ns.
Min: 11 ns
rdtsc (note: no context switches occurred during this run, but if it happens, it usually results in just a single measurement of 30000 ticks or so):
1 measurement of 1256 ticks.
Min: 274 ticks
Discussion:
rdtsc
gives very similar results on both Linux and Windows. It has an acceptable standard deviation--it is actually quite consistent/stable. However, it does not account for thread idle time. Therefore, context switches make the measurements erratic (on Windows I have observed this quite often: a code segment with an average of 1000 ticks or so will take ~30000 ticks every now and then--definitely because of pre-emption).
QueryThreadCycleTime
gives very consistent measurements--i.e. much lower standard deviation when compared to rdtsc
. When no context switches happen, this method is almost identical to rdtsc
.
clock_gettime
, on the other hand, is producing extremely inconsistent results (not just between runs, but also between measurements). The standard deviations are extreme (when compared to rdtsc
).
I hope the statistics are okay. But what could be the reason for such a discrepancy in the measurements between the two methods? Of course, there is caching, CPU/core migration, and other things. But none of this should be responsible for any such differences between 'rdtsc' and 'clock_gettime'. What is going on?
I have investigated this a bit further. I have done two things:
Measured the overhead of just calling clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t)
(see code 1 in Appendix), and
in a plain loop called clock_gettime
and stored the readings into an array (see code 2 in Appendix). I measure the delta times (difference in successive measurement times, which should correspond a bit to the overhead of the call of clock_gettime
).
I have measured it on two different computers with two different Linux Kernel versions:
CGT:
CPU: Core 2 Duo L9400 @ 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
clock_gettime
overhead: between 690-710 nsDelta times:
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
697 < x ≤ 800 -> 78111 <-- cached?
800 < x ≤ 1000 -> 16412
1000 < x ≤ 1500 -> 3
1500 < x ≤ 2000 -> 4836 <-- uncached?
2000 < x ≤ 3000 -> 305
3000 < x ≤ 5000 -> 161
5000 < x ≤ 10000 -> 105
10000 < x ≤ 15000 -> 53
15000 < x ≤ 20000 -> 8
20000 < x -> 5
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
clock_gettime
overhead: between 279-283 nsDelta times:
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
--------------------+-----------
x ≤ 1 -> 86738 <-- cached?
282 < x ≤ 300 -> 13118 <-- uncached?
300 < x ≤ 440 -> 78
2000 < x ≤ 5000 -> 52
5000 < x ≤ 30000 -> 5
3000000 < x -> 8
RDTSC:
Related code rdtsc_delta.c
and rdtsc_overhead.c
.
CPU: Core 2 Duo L9400 @ 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
Delta times:
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
34 < x ≤ 35 -> 16240 <-- cached?
41 < x ≤ 42 -> 63585 <-- uncached? (small difference)
48 < x ≤ 49 -> 19779 <-- uncached?
49 < x ≤ 120 -> 195
3125 < x ≤ 5000 -> 144
5000 < x ≤ 10000 -> 45
10000 < x ≤ 20000 -> 9
20000 < x -> 2
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
Delta times:
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
13 < x ≤ 14 -> 192
14 < x ≤ 21 -> 78172 <-- cached?
21 < x ≤ 50 -> 10818
50 < x ≤ 103 -> 10624 <-- uncached?
5825 < x ≤ 6500 -> 88
6500 < x ≤ 8000 -> 88
8000 < x ≤ 10000 -> 11
10000 < x ≤ 15000 -> 4
15000 < x ≤ 16372 -> 2
QTCT:
Related code qtct_delta.c
and qtct_overhead.c
.
CPU: Core 2 6700 @ 2.66GHz
Kernel: Windows 7 64-bit
Results:
Delta times:
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
879 < x ≤ 890 -> 71347 <-- cached?
895 < x ≤ 1469 -> 844
1469 < x ≤ 1600 -> 27613 <-- uncached?
1600 < x ≤ 2000 -> 55
2000 < x ≤ 4000 -> 86
4000 < x ≤ 8000 -> 43
8000 < x ≤ 16000 -> 10
16000 < x -> 1
I believe the answer to my question would be a buggy implementation on my machine (the one with AMD CPUs with an old Linux kernel).
The CGT results of the AMD machine with the old kernel show some extreme readings. If we look at the delta times, we'll see that the most frequent delta is 1 ns. This means that the call to clock_gettime
took less than a nanosecond! Moreover, it also produced a number of extraordinary large deltas (of more than 3000000 ns)! This seems to be erroneous behaviour. (Maybe unaccounted core migrations?)
Remarks:
The overhead of CGT and QTCT is quite big.
It is also difficult to account for their overhead, because CPU caching seems to make quite a big difference.
Maybe sticking to RDTSC, locking the process to one core, and assigning real-time priority is the most accurate way to tell how many cycles a piece of code used...
Code 1: clock_gettime_overhead.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
/* Compiled & executed with:
gcc clock_gettime_overhead.c -O0 -lrt -o clock_gettime_overhead
./clock_gettime_overhead 100000
*/
int main(int argc, char **args) {
struct timespec tstart, tend, dummy;
int n, N;
N = atoi(args[1]);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tstart);
for (n = 0; n < N; ++n) {
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tend);
printf("Estimated overhead: %lld ns\n",
((int64_t) tend.tv_sec * 1000000000 + (int64_t) tend.tv_nsec
- ((int64_t) tstart.tv_sec * 1000000000
+ (int64_t) tstart.tv_nsec)) / N / 10);
return 0;
}
Code 2: clock_gettime_delta.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
/* Compiled & executed with:
gcc clock_gettime_delta.c -O0 -lrt -o clock_gettime_delta
./clock_gettime_delta > results
*/
#define N 100000
int main(int argc, char **args) {
struct timespec sample, results[N];
int n;
for (n = 0; n < N; ++n) {
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &sample);
results[n] = sample;
}
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n",
(int64_t) results[n].tv_sec * 1000000000 +
(int64_t)results[n].tv_nsec,
(int64_t) results[n].tv_sec * 1000000000 +
(int64_t) results[n].tv_nsec -
((int64_t) results[n-1].tv_sec * 1000000000 +
(int64_t)results[n-1].tv_nsec));
}
return 0;
}
Code 3: rdtsc.h
static uint64_t rdtsc() {
#if defined(__GNUC__)
# if defined(__i386__)
uint64_t x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
# elif defined(__x86_64__)
uint32_t hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)lo) | ((uint64_t)hi << 32);
# else
# error Unsupported architecture.
# endif
#elif defined(_MSC_VER)
return __rdtsc();
#else
# error Other compilers not supported...
#endif
}
Code 4: rdtsc_delta.c
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"
/* Compiled & executed with:
gcc rdtsc_delta.c -O0 -o rdtsc_delta
./rdtsc_delta > rdtsc_delta_results
Windows:
cl -Od rdtsc_delta.c
rdtsc_delta.exe > windows_rdtsc_delta_results
*/
#define N 100000
int main(int argc, char **args) {
uint64_t results[N];
int n;
for (n = 0; n < N; ++n) {
results[n] = rdtsc();
}
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
}
return 0;
}
Code 5: rdtsc_overhead.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"
/* Compiled & executed with:
gcc rdtsc_overhead.c -O0 -lrt -o rdtsc_overhead
./rdtsc_overhead 1000000 > rdtsc_overhead_results
Windows:
cl -Od rdtsc_overhead.c
rdtsc_overhead.exe 1000000 > windows_rdtsc_overhead_results
*/
int main(int argc, char **args) {
uint64_t tstart, tend, dummy;
int n, N;
N = atoi(args[1]);
tstart = rdtsc();
for (n = 0; n < N; ++n) {
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
}
tend = rdtsc();
printf("%G\n", (double)(tend - tstart)/N/10);
return 0;
}
Code 6: qtct_delta.c
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
/* Compiled & executed with:
cl -Od qtct_delta.c
qtct_delta.exe > windows_qtct_delta_results
*/
#define N 100000
int main(int argc, char **args) {
uint64_t ticks, results[N];
int n;
for (n = 0; n < N; ++n) {
QueryThreadCycleTime(GetCurrentThread(), &ticks);
results[n] = ticks;
}
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
}
return 0;
}
Code 7: qtct_overhead.c
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
/* Compiled & executed with:
cl -Od qtct_overhead.c
qtct_overhead.exe 1000000
*/
int main(int argc, char **args) {
uint64_t tstart, tend, ticks;
int n, N;
N = atoi(args[1]);
QueryThreadCycleTime(GetCurrentThread(), &tstart);
for (n = 0; n < N; ++n) {
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
}
QueryThreadCycleTime(GetCurrentThread(), &tend);
printf("%G\n", (double)(tend - tstart)/N/10);
return 0;
}
Well as CLOCK_THREAD_CPUTIME_ID
is implemented using rdtsc
it will likely suffer from the same problems as it. The manual page for clock_gettime
says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
Which sounds like it might explain your problems? Maybe you should lock your process to one CPU to get stable results?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With