Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

x86_64 - Assembly - loop conditions and out of order

I am not asking for a benchmark.

(If that was the case, I would have done it myself.)


My question:

I tend to avoid the indirect/index addressing modes for convenience.

As a replacement, I often use immediate, absolute or register addressing.

The code:

; %esi has the array address. Say we iterate a doubleword (4bytes) array.
; %ecx is the array elements count
(0x98767) myloop:
    ... ;do whatever with %esi
    add $4, %esi
    dec %ecx
    jnz 0x98767;

Here, we have a serialized combo(dec and jnz) which prevent proper out of order execution (dependency).

Is there a way to avoid that / break the dep? (I am not an assembly expert).

like image 412
Kroma Avatar asked Aug 02 '15 11:08

Kroma


People also ask

How do loops work in assembly?

A loop is a block of statements that are repeatedly executed until a condition is satisfied. The assembly language uses JMP instruction to implement loops. However, the processor set can use the LOOP instruction to implement loops conveniently.

What is the purpose of the x86 assembly language loop instruction?

❖ Loop Instruction The Loop instruction provides a simple way to repeat a block of statements a specific number of times. ECX is automatically used as a counter and is decremented each time the loop repeats.


Video Answer


1 Answers

When optimizing for Intel CPUs, always put the flag-setting instruction right before the conditional jump instruction (if it's one of the simple ones listed in the table below), so they can macro-fuse into one uop in the decoders.

Doing this is not significantly worse for older CPUs that don't do macro-fusion. Putting the flag-setting earlier might shorten the branch mispredict penalty by one for such CPUs, but out-of-order execution means that moving the dec a couple instruction earlier won't make a real difference. See also Avoid stalling pipeline by calculating conditional early. To really make a difference, you do stuff like unroll the loop and/or branch on something that can be calculated more simply, ideally without a dependency on a slow input, so OoO exec can have the branch already resolved while working on older iterations of the loop body. i.e. the loop counter dep-chain can run ahead of the main work.

I don't have benchmarks, but I don't think the small downside on increasingly-rare CPUs justifies missing out on the front-end throughput benefit (decode and issue) for CPUs that do fusion. Total uop throughput can often be a bottleneck.

AMD Bulldozer/Piledriver/Steamroller can fuse test/cmp with any jcc, but only test/cmp, not any other ALU instructions. So definitely put compares with branches. It's still valuable for Intel CPUs to put other things with branches if they can macro-fuse on sandybridge-family.

From Agner Fog's microarch guide, Table 9.2 (for Sandybridge / Ivybridge):

First       | can pair with these  |  cannot pair with
instruction | (and the inverse)    |
---------------------------------------------
cmp         |jz, jc, jb, ja, jl, jg|   js, jp, jo
add, sub    |jz, jc, jb, ja, jl, jg|   js, jp, jo
adc, sbb    |none                  |
inc, dec    |jz, jl, jg            |   jc, jb, ja, js, jp, jo
test        | all                  |
and         | all                  |
or, xor, not, neg | none           |
shift, rotate     | none           |

Table 9.2. Instruction fusion

So basically, inc/dec can macro-fuse with a jcc as long as the condition only depends on bits that are modified by inc/dec.

(Otherwise, they don't macro-fuse, and you get an extra uop inserted to merge the flags (like when you read eax after writing al). Or on earlier CPUs, a partial-flags stall.)

Core2 / Nehalem was more limited in macro-fusion capability (just for CMP/TEST with more limited JCC combinations), and Core2 couldn't macro-fuse in 64bit mode at all.

Read Agner Fog's optimizing asm and C guides, too, if you haven't already. They're full of essential knowledge.

like image 129
Peter Cordes Avatar answered Nov 05 '22 01:11

Peter Cordes