Is there a quick/easy way to do this (for at least a rough estimate) ?
I'm benchmarking algorithms and I thought it would be cool to know the absolute speed at which my computer is executing instructions and compare that with my asymptotic analysis.
The thousand instructions per second (kIPS) unit is rarely used today, as most current microprocessors can execute at least a million instructions per second.
If you want to know what your CPU can do, then look at the documentation. Your CPU vendor specifies the latency and throughput of all instructions, as well as a variety of other information (how many instructions can be issued or retired per cycle, cache latencies and much more). Based on this, you can compute the theoretical peak throughput.
If you want to do what your CPU is actually doing then run your own code and measure its performance.
However, keep in mind that modern CPUs are really complex beasts, and their performance depends on a wide variety of factors, and you will very rarely be able to come anywhere near maxing out your CPU, and understanding why, or what exactly is holding your code back requires a fairly thorough understanding of the hardware. (My usual rule of thumb is that you're doing very good if you get a sustained 30-40% of the theoretical peak FLOPS)
This is a typical case of "In theory, theory and practice is the same, in practice they are not".
Modern CPU's have very sophisticated logic in them, which means that the ACTUAL number of operations performed is different from what you'd think from just looking at code or thinking about the problem [unless you have a brain the size of a small planet and know how that particular CPU works]. For example, a processor may speculatively execute instructions on one or another side of a branch, even if it hasn't quite got to the branch - if that's the "wrong" side, then it will discard the results of those instructions - but of course it took time to execute them.
Instructions are also executed out of order, which means that it's hard to predict exactly which instruction will execute when. There are some exceptions.
You will only get (anywhere near) the theoretical throughput if you are pushing data and instructions through all the available execution units at once - this means having the right mix of instructions, and of course ALL of the code and data in caches.
So, in theory we could stuff the processor full of instructions that maxes it out, by writing very clever code. In practice, that turns very very very quickly into a hard task.
However, the question is about measuring the throughput of instructions, and on modern CPU's, this is very much possible with the right extra software. On linux perftool or oprofile, for windows there is Intel's VTune and AMD's Code Analyst. These will allow you (subject to sufficient privileges) to fetch the "performance counters" in the processor, which has counters for "number of instructions", "number of float operatons", "number of cache misses", "branch mispredicted" and many, many other measurements of processor performance. So given a sufficient length of runtime (at least a few seconds, preferrably more), you can measure the actual count or clock cycles that a processor performs.
In practice these days, the effective number of instructions depends primarily on memory latency, which is the main bottleneck on performance. Waiting for data is bad. Processors can alleviate this problem somewhat with techniques such as caching, pipelining and concurrency, but the issue remains and will only get worse over time.
Proper implementation can make a huge difference. You may want to check out this question about cache-friendly code.
Modern CPUs are pipelining instruction processing, so there is no constant as such.
You could however, read out number of CPU ticks at start of your algo and at the end. I think this is as low level as you can get with such measuring.
http://en.wikipedia.org/wiki/Time_Stamp_Counter
Note: There are a lot of problems why this won't be 100% accurate, I can mention few, but I am sure the community will be able to add to the list: -OS pre-empting you process -cache misses (algo will run slower the first time, faster if it's ran subsequently) -on older CPUs, the CPU ticks are not invariant to the CPU frequency
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With