I'm looking to do some very basic micro benchmarking of small code paths, such as tight loops, that I've written in C++. I'm running on Linux and OSX, and using GCC. What facilities are there for sub millisecond accuracy? I am thinking a simple test of running the code path many times (several tens of millions?) will give me enough consistency to get a good reading. If anyone knows of preferable methods, please feel free to suggest them.
You can use "rdtsc"
processor instruction on x86/x86_64. For multicore systems check the "constant_tsc" capability in CPUID (/proc/cpuinfo in linux) - it will mean that all cores use the same tick counter, even with dynamic freq changing and sleeping.
If your processor does not support constant_tsc, be sure to bind you programm to the core (taskset
utility in Linux).
When using rdtsc on out-of-order CPUs (All besides Intel Atom, may be some other low-end cpus), add an "ordering" instruction before, e.g. "cpuid" - it will temporary disable instruction reordering.
Also, MacOsX has "Shark" which can measure some hardware events in your code.
RDTSC
and out-of-order CPUs. More info in section 18 of the 2nd great Fog's manual on optimization: Optimizing subroutines in assembly language: An optimization guide for x86 platforms (the main site with all the five manuals is http://www.agner.org/optimize/)
http://www.scribd.com/doc/1548519/optimizing-assembly
On all processors with out-of-order execution, you have to insert XOR EAX,EAX / CPUID before and after each read of the counter in order to prevent it from executing in parallel with anything else. CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. This is very useful for testing purposes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With