TL;DR: the first loop runs ~18% faster on a Haswell CPU. Why? The loops are from gcc -O0
(un-optimized) loops using ptr++
vs ++ptr
, but the question is why the resulting asm performs differently, not anything about how to write better C.
Let's say we have those two loops:
movl $0, -48(%ebp) //Loop counter set to 0
movl $_data, -12(%ebp) //Pointer to the data array
movl %eax, -96(%ebp)
movl %edx, -92(%ebp)
jmp L21
L22:
// ptr++
movl -12(%ebp), %eax //Get the current address
leal 4(%eax), %edx //Calculate the next address
movl %edx, -12(%ebp) //Store the new (next) address
// rest of the loop is the same as the other
movl -48(%ebp), %edx //Get the loop counter to edx
movl %edx, (%eax) //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one
addl $1, -48(%ebp) //Increase the counter
L21:
cmpl $999999, -48(%ebp)
jle L22
and the second one:
movl %eax, -104(%ebp)
movl %edx, -100(%ebp)
movl $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte)
movl $0, -48(%ebp) //Set the loop counter to 0
jmp L23
L24:
// ++ptr
addl $4, -12(%ebp) //Calculate the CURRENT address by adding one sizeof(int)==4 bytes
movl -12(%ebp), %eax //Store in eax the address
// rest of the loop is the same as the other
movl -48(%ebp), %edx //Store in edx the current loop counter
movl %edx, (%eax) //Move the loop counter value to the current stored address location
addl $1, -48(%ebp) //Increase the loop counter
L23:
cmpl $999999, -48(%ebp)
jle L24
Those loops are doing exactly the same thing but in a little different manner, please refer to the comment for the details.
This asm code is generated from the following two C++ loops:
//FIRST LOOP:
for(;index<size;index++){
*(ptr++) = index;
}
//SECOND LOOP:
ptr = data - 1;
for(index = 0;index<size;index++){
*(++ptr) = index;
}
Now, the first loop is about ~18% faster than the second one, no matter in which order the loops are performed the one with ptr++
is faster than the one with ++ptr
.
To run my benchmarks I just collected the running time of those loops for different size, and executing them both nested in other loops to repeat the operation frequently.
Looking at the ASM code, the second loop contains less instructions, we have 3 movl and 2 addl whereas in the first loop we have 4 movl one addl and one leal, so we have one movl more and one leal instead of addl
Is that correct that the LEA
operation for computing the correct address is much faster than the ADD
(+4) method? Is this the reason for the difference in performance?
As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding, whereas in the first loop we can immediately refer the memory and in the meanwhile LEAL will compute the next address (some kind of better pipeline performance here).
Is there some reordering going on here? I'm not sure about my explanation for the performance difference of those loops, can I have your opinion?
First of all, performance analysis on -O0
compiler output is usually not very interesting or useful.
Is that correct that the LEAL operation for computing the correct address is much faster than the ADDL (+4) method? Is this the reason for the difference in performance?
Nope, add
can run on every ALU execution port on any x86 CPU. lea
is usually as low latency with simple addressing modes, but not as good throughput. On Atom, it runs in a different stage of the pipeline from normal ALU instructions, because it actually lives up to its name and uses the AGU on that in-order microarchitecture.
See the x86 tag wiki to learn what makes code slow or fast on different microarchitectures, esp. Agner Fog's microarchitecture pdf and instruction tables.
add
is only worse because it lets gcc -O0
make even worse code by using it with a memory destination and then loading from that.
Compiling with -O0
doesn't even try to use the best instructions for the job. e.g. you'll get mov $0, %eax
instead of the xor %eax,%eax
you always get in optimized code. You shouldn't infer anything about what's good from looking at un-optimized compiler output.
-O0
code is always full of bottlenecks, usually on load/store or store-forwarding. Unfortunately IACA doesn't account for store-forwarding latency, so it doesn't realize that these loops actually bottleneck on
As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding,
Yes, the mov
load of -12(%ebp)
won't be ready for about 6 cycles after the load that was part of add
's read-modify-write.
whereas in the first loop we can immediately refer the memory
Yes
and in the meanwhile LEAL will compute the next address
No.
Your analysis is close, but you missed the fact that the next iteration still has to load the value we stored into -12(%ebp)
. So the loop-carried dependency chain is the same length, and next iteration's lea
can't actually start any sooner than in the loop using add
uop / execution port throughput needs to be considered. In this case, the OP's testing shows it's actually relevant. (Or latency from resource conflicts.)
When gcc -O0
implements ptr++
, it keeps the old value in a register, like you said. So store addresses are known further ahead of time, and there's one fewer load uop that needs an AGU.
Assuming an Intel SnB-family CPU:
## ptr++: 1st loop
movl -12(%ebp), %eax //1 uop (load)
leal 4(%eax), %edx //1 uop (ALU only)
movl %edx, -12(%ebp) //1 store-address, 1 store-data
// no load from -12(%ebp) into %eax
... rest the same.
## ++ptr: 2nd loop
addl $4, -12(%ebp) // read-modify-write: 2 fused-domain uops. 4 unfused: 1 load + 1 store-address + 1 store-data
movl -12(%ebp), %eax // load: 1 uop. ~6 cycle latency for %eax to be ready
... rest the same
So the pointer-increment part of the 2nd loop has one more load uop. Probably the code bottlenecks on AGU throughput (address-generation units). IACA says that's the case for arch=SNB, but that HSW bottlenecks on store-data throughput (not AGUs).
However, without taking store-forwarding latency into account, IACA says the first loop can run at one iteration per 3.5 cycles, vs. one per 4 cycles for the second loop. That's faster than the 6 cycle loop-carried dependency of the addl $1, -48(%ebp)
loop counter, which indicates that the loop is bottlenecked by latency to less than max AGU throughput. (Resource conflicts probably mean it actually runs slower than one iteration per 6c, see below).
Adding an extra load uop to the lea
version, off the critical path, would take more throughput, but wouldn't be part of the loop's latency chains. e.g.
movl -12(%ebp), %eax //Get the current address
leal 4(%eax), %edx //Calculate the next address
movl %edx, -12(%ebp) //Store the new (next) address
mov -12(%ebp), %edx
%edx
is about to be overwritten by a mov
, so there are no dependencies on the result of this load. (The destination of mov
is write-only, so it breaks dependency chains, thanks to register renaming.).
So this extra load would bring the lea
loop up to the same number and flavour of uops as the add
loop, but with different latency. If the extra load has no effect on speed, we know the first loop isn't bottlenecked on load / store throughput.
Update: OP's testing confirmed that an extra unused load slows the lea
loop down to about the same speed as the add
loop.
uops are scheduled in oldest-first order (out of uops that have their operands ready), not in critical-path-first order. Extra uops that could have been done in a spare cycle later on will actually delay uops that are on the critical path (e.g. part of the loop-carried dependency). This is called a resource conflict, and can increase the latency of the critical path.
i.e. instead of waiting for a cycle where critical-path latency left a load port with nothing to do, the unused load will run when it's the oldest load with its load-address ready. This will delay other loads.
Similarly, in the add
loop where the extra load is part of the critical path, the extra load causes more resource conflicts, delaying operations on the critical path.
Other guesses:
So maybe having the store address ready sooner is what's doing it, so memory operations are pipelined better. (e.g. TLB-miss page walks can start sooner when approaching a page boundary. Even normal hardware prefetching doesn't cross page boundaries, even if they are hot in the TLB. The loop touches 4MiB of memory, which is enough for this kind of thing to matter. L3 latency is high enough to maybe create a pipeline bubble. Or if your L3 is small, then main memory certainly is.
Or maybe the extra latency just makes it harder for out-of-order execution to do a good job.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With