When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruction took to execute. Depending on family, one (or four) cycles equated to one "memory fetch", and without caches to worry about, you could guess timings based on the number of memory accesses involved. But with modern CPU's, I'm confused. I know they're a lot faster, but I also know that the headline gigahertz speed isn't helpful without knowing how many cycles of that clock are needed for each instruction. So, can anyone provide some timings for two sample instructions, on (let's say) a 2GHz Core 2 Duo. Best and worst cases (assuming nothing in cache/everything in cache) would be useful. Instruction #1: Add one 32-bit register to a second. Instruction #2: Move a 32-bit value from register to memory. Edit: The reason I ask this is to try and develop a "rule-of-thumb" that would allow me to look at simple code and roughly gauge the time taken to the nearest order of magnitude. Edit #2: Lots of answers with interesting points, but nobody (yet) has put down a figure measured in time. I appreciate there are "complications" to the question, but c'mon: If we can estimate the number of piano-tuners in NYC, we should be able to estimate code runtimes... Take the following (dumb) code: <pre class="prettyprint"><code>int32 sum = frigged_value(); // start timing for (int i = 0 ; i < 10000; i++) { for (int j = 0 ; j < 10000; j++) { sum += (i * j) } sum = sum / 1000; } // end timing </code></pre> How can we estimate how long it will take to run... 1 femtosecond? 1 gigayear?

It is nearly impossible to provide accurate timing information that you are expecting in a way that will be USEFUL to you. The following concepts affect instruction timing; some can vary from moment to moment: <ul> <li>Micro-op decomposition</li> <li>Operation pipelining</li> <li>Super-scalar execution</li> <li>Out of order execution</li> <li>SMT / SMP execution</li> <li>Floating point mode</li> <li>Branch prediction / pre-fetch</li> <li>Cache latency</li> <li>Memory latency</li> <li>Clock speed throttling</li> <li>etc</li> </ul> Consult a book on modern computer architecture if you need any further explanation on the above concepts. The best way to measure the speed of your code is (surprise!) to measure the speed of your code running the same workload and under the same conditions as you expect it to when "in the real world".

Using a description largely based on Intel Pentium architecture, to cut a very very long story short: <ul> <li>the processor has a number of "execution units" that can perform different types of 'micro-ops'; instructions may be split into several micro-ops</li> <li>the different execution units essentially run in parallel</li> <li>each micro-op ties up the corresponding execution unit for a certain number of clock cycles so meanwhile no other instruction can use that execution unit: e.g. "floating point add" may tie up the "FP execute" unit for 2 clock cycles</li> <li>execution units are grouped by "port", and each clock cycle, a new micro-op can be sent to each port (assuming the relevant execution unit is free at that moment); some units can also be sent an "extra op" halfway through the cycle; so each clock cycle, a certain number of ops can start executing;</li> <li>the processor can re-order micro-ops where this doesn't break dependencies (or where the result can still be reconstructed) to take advantage of which execution units are free at a given moment</li> <li>so instructions can be executing in parallel, but which parts of which instructions are executing at any one time is quite a complex situation</li> <li>the overall time for a given instruction thus depends on how long it had to "wait" for the necessary execution units to become available, the actual time that those ops spent running on the given units, plus any extra time required to "tie up the result"</li> </ul> Since the timing of an instruction depends on the surrounding instructions, in practice, it's usually best to time a representative piece of code than try and worry about individual instructions. However: <ul> <li>Intel (and presumably other manufacturers) publish a list of instruction throughput and latency timings</li> <li>the throughput is the number of clock cycles actually needed on the relevant execution unit(s)</li> <li>the latency is a "worst case" number of clock cycles required, once an instruction starts executing, before the result of that execution is available as input to another instruction</li> </ul> So for example, if, say, floating point add and multiply instructions each have a throughput of 2 and a latency of 5 (actually, for multiply it's a bit greater I think), that means that adding a register to itself or multiplying it by itself will likely take two clock cycles (since there are no other dependent values), whereas adding it the result of a previous multiplication will take something like or a bit less than 2+5 clock cycles, depending where you start/finish timing, and on all sorts of other things. (During some of those clock cycles, another add/multiply operation could be taking place, so it's arguable how many cycles you actually attribute to the individual add/mutliply instructions anyway...) Oh, and just as a concrete example. For following Java code <pre class="prettyprint"><code>public void runTest(double[] data, double randomVal) { for (int i = data.length-1; i >= 0; i--) { data[i] = data[i] + randomVal; } } </code></pre> Hotspot 1.6.12 JIT-compiles the inner loop sequence to the following Intel code, consisting of a load-add-store for each position in the array (with 'randomVal' being held in XMM0a in this case): <pre class="prettyprint"><code> 0b3 MOVSD XMM1a,[EBP + #16] 0b8 ADDSD XMM1a,XMM0a 0bc MOVSD [EBP + #16],XMM1a 0c1 MOVSD XMM1a,[EBP + #8] 0c6 ADDSD XMM1a,XMM0a 0ca MOVSD [EBP + #8],XMM1a ... </code></pre> each group of load-add-store appears to take 5 clock cycles.

Exactly how "fast" are modern CPUs?

Tags:

performance

cpu-architecture

x86

assembly

intel

cpu-speed

When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruction took to execute. Depending on family, one (or four) cycles equated to one "memory fetch", and without caches to worry about, you could guess timings based on the number of memory accesses involved.

But with modern CPU's, I'm confused. I know they're a lot faster, but I also know that the headline gigahertz speed isn't helpful without knowing how many cycles of that clock are needed for each instruction.

So, can anyone provide some timings for two sample instructions, on (let's say) a 2GHz Core 2 Duo. Best and worst cases (assuming nothing in cache/everything in cache) would be useful.

Instruction #1: Add one 32-bit register to a second.

Instruction #2: Move a 32-bit value from register to memory.

Edit: The reason I ask this is to try and develop a "rule-of-thumb" that would allow me to look at simple code and roughly gauge the time taken to the nearest order of magnitude.

Edit #2: Lots of answers with interesting points, but nobody (yet) has put down a figure measured in time. I appreciate there are "complications" to the question, but c'mon: If we can estimate the number of piano-tuners in NYC, we should be able to estimate code runtimes...

Take the following (dumb) code:

int32 sum = frigged_value();

// start timing
 for (int i = 0 ; i < 10000; i++)
 {
   for (int j = 0 ; j < 10000; j++)
   {
     sum += (i * j)
   }
   sum = sum / 1000;
 }

// end timing

How can we estimate how long it will take to run... 1 femtosecond? 1 gigayear?

415

asked Jan 11 '09 15:01

Roddy

3 Answers

Modern processors such as Core 2 Duo that you mention are both superscalar and pipelined. They have multiple execution units per core and are actually working on more than one instruction at a time per core; this is the superscalar part. The pipelined part means that there is a latency from when an instruction is read in and "issued" to when it completes execution and this time varies depending on the dependencies between that instruction and the others moving through the other execution units at the same time. So, in effect, the timing of any given instruction varies depending on what is around it and what it is depending on. This means that a given instruction has sort of a best case and worst case execution time based on a number of factors. Because of the multiple execution units you can actually have more than one instruction completing execution per core clock, but sometimes there is several clocks between completions if the pipeline has to stall waiting for memory or dependencies in the pipelines.

All of the above is just from the view of the CPU core itself. Then you have interactions with the caches and contention for bandwidth with the other cores. The Bus Interface Unit of the CPU deals with getting instructions and data fed into the core and putting results back out of the core through the caches to memory.

Rough order of magnitude rules of thumb to be taken with a grain of salt:

Register to Register operations take 1 core clock to execute. This should generally be conservative especially as more of these appear in sequence.
Memory related load and store operations take 1 memory bus clock to execute. This should be very conservative. With a high cache hit rate it will be more like 2 CPU bus clocks which is the clock rate of the bus between the CPU core and the cache, but not necessarily the core's clock.

180

answered Oct 10 '22 11:10

Tall Jeff

It is nearly impossible to provide accurate timing information that you are expecting in a way that will be USEFUL to you.

The following concepts affect instruction timing; some can vary from moment to moment:

Micro-op decomposition
Operation pipelining
Super-scalar execution
Out of order execution
SMT / SMP execution
Floating point mode
Branch prediction / pre-fetch
Cache latency
Memory latency
Clock speed throttling
etc

Consult a book on modern computer architecture if you need any further explanation on the above concepts.

The best way to measure the speed of your code is (surprise!) to measure the speed of your code running the same workload and under the same conditions as you expect it to when "in the real world".

answered Oct 10 '22 11:10

HUAGHAGUAH

Using a description largely based on Intel Pentium architecture, to cut a very very long story short:

the processor has a number of "execution units" that can perform different types of 'micro-ops'; instructions may be split into several micro-ops
the different execution units essentially run in parallel
each micro-op ties up the corresponding execution unit for a certain number of clock cycles so meanwhile no other instruction can use that execution unit: e.g. "floating point add" may tie up the "FP execute" unit for 2 clock cycles
execution units are grouped by "port", and each clock cycle, a new micro-op can be sent to each port (assuming the relevant execution unit is free at that moment); some units can also be sent an "extra op" halfway through the cycle; so each clock cycle, a certain number of ops can start executing;
the processor can re-order micro-ops where this doesn't break dependencies (or where the result can still be reconstructed) to take advantage of which execution units are free at a given moment
so instructions can be executing in parallel, but which parts of which instructions are executing at any one time is quite a complex situation
the overall time for a given instruction thus depends on how long it had to "wait" for the necessary execution units to become available, the actual time that those ops spent running on the given units, plus any extra time required to "tie up the result"

Since the timing of an instruction depends on the surrounding instructions, in practice, it's usually best to time a representative piece of code than try and worry about individual instructions. However:

Intel (and presumably other manufacturers) publish a list of instruction throughput and latency timings
the throughput is the number of clock cycles actually needed on the relevant execution unit(s)
the latency is a "worst case" number of clock cycles required, once an instruction starts executing, before the result of that execution is available as input to another instruction

So for example, if, say, floating point add and multiply instructions each have a throughput of 2 and a latency of 5 (actually, for multiply it's a bit greater I think), that means that adding a register to itself or multiplying it by itself will likely take two clock cycles (since there are no other dependent values), whereas adding it the result of a previous multiplication will take something like or a bit less than 2+5 clock cycles, depending where you start/finish timing, and on all sorts of other things. (During some of those clock cycles, another add/multiply operation could be taking place, so it's arguable how many cycles you actually attribute to the individual add/mutliply instructions anyway...)

Oh, and just as a concrete example. For following Java code

public void runTest(double[] data, double randomVal) {
  for (int i = data.length-1; i >= 0; i--) {
    data[i] = data[i] + randomVal;
  }
}

Hotspot 1.6.12 JIT-compiles the inner loop sequence to the following Intel code, consisting of a load-add-store for each position in the array (with 'randomVal' being held in XMM0a in this case):

  0b3     MOVSD  XMM1a,[EBP + #16]
  0b8     ADDSD  XMM1a,XMM0a
  0bc     MOVSD  [EBP + #16],XMM1a
  0c1     MOVSD  XMM1a,[EBP + #8]
  0c6     ADDSD  XMM1a,XMM0a
  0ca     MOVSD  [EBP + #8],XMM1a
  ...

each group of load-add-store appears to take 5 clock cycles.

answered Oct 10 '22 11:10

Neil Coffey

Related questions
                            
                                how do addressing modes work on a physical level?
                            
                                Task management on x86
                            
                                how does cpu calculate 20-bit address in real mode
                            
                                How does ptrace POKETEXT works when modifying program text?
                            
                                Shortest Intel x86-64 opcode for rax=1?
                            
                                How does the CPU/assembler know the size of the next instruction?
                            
                                Weird results with movzwl, %ax and negative values
                            
                                How to model a CPU using qemu?
                            
                                Enumerate all installed versions of an assembly (in GAC)
                            
                                Which assemblers currently support the AVX instruction set?
                            
                                Displaying extended ASCII characters
                            
                                How much speed-up from converting 3D maths to SSE or other SIMD?
                            
                                Win32 EXCEPTION_INT_OVERFLOW vs EXCEPTION_INT_DIVIDE_BY_ZERO
                            
                                C to assembly call convention 32bit vs 64bit
                            
                                What are my available march/mtune options?
                            
                                x86 MUL Instruction from VS 2008/2010
                            
                                Crash with icc: can the compiler invent writes where none existed in the abstract machine?
                            
                                Divide and Get Remainder at the same time?
                            
                                What is the function of the push / pop instructions used on registers in x86 assembly?
                            
                                What Every Programmer Should Know About Memory?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Exactly how "fast" are modern CPUs?

Tags:

performance

cpu-architecture

x86

assembly

intel

cpu-speed

Roddy

People also ask

3 Answers

Tall Jeff

HUAGHAGUAH

Neil Coffey

Recent Activity

Donate For Us