Performance of modern processor

Question

Being executed on modern processor (AMD Phenom II 1090T), how many clock ticks does the following code consume more likely : 3 or 11?

label:  mov (%rsi), %rax
        adc %rax, (%rdx)
        lea 8(%rdx), %rdx
        lea 8(%rsi), %rsi
        dec %ecx
        jnz label

The problem is, when I execute many iterations of such code, results vary near 3 OR 11 ticks per iteration from time to time. And I can't decide "who is who".

UPD According to Table of instruction latencies (PDF), my piece of code takes at least 10 clock cycles on AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by bugs in measurement.

SOLVED @Atom noticed, that cycle frequency isn't constant in modern processors. When I disabled in BIOS three options - Core Performance Boost, AMD C1E Support and AMD K8 Cool&Quiet Control, consumption of my "six instructions" stabilized on 3 clock ticks :-)

Mysticial · Accepted Answer

I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.

(Note that this is for processors in general and I make no references specific to AMD processors.)

Key Concepts:

Out of Order Execution
Register Renaming

Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.

Let's break down your example:

label:
    mov (%rsi), %rax
    adc %rax, (%rdx)
    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx
    jnz label

The first thing to notice is that the last 3 instructions before the branch are all independent:

    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx

So it's possible for a processor to execute all 3 of these in parallel.

Another thing is this:

adc %rax, (%rdx)
lea 8(%rdx), %rdx

There seems to be a dependency on rdx that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually depend on the output of the first instruction. Modern processors are able to rename the rdx register to allow these two instructions to be re-ordered or done in parallel.

Same applies to the rsi register between:

mov (%rsi), %rax
lea 8(%rsi), %rsi

So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):

1:   mov (%rsi), %rax        lea 8(%rdx), %rdx        lea 8(%rsi), %rsi
2:   adc %rax, (%rdx)        dec %ecx
3:   jnz label

*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.

In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...

Crashworks · Answer

At Intel, Dr. David Levinthal's "Performance Analysis Guide" investigates the answers to such questions in great detail.

Performance of modern processor

Tags:

performance

assembly

x86-64

amd-processor

leventov

2 Answers

Mysticial

Crashworks

Recent Activity

Donate For Us

Performance of modern processor

Tags:

performance

assembly

x86-64

amd-processor

leventov

2 Answers

Mysticial

Crashworks

Related questions

Recent Activity

Donate For Us