I wrote this snippet in a recent argument over the supposed speed of array[i++]
vs array[i]; i++
.
int array[10];
int main(){
int i=0;
while(i < 10){
array[i] = 0;
i++;
}
return 0;
}
Snippet at the compiler explorer: https://godbolt.org/g/de7TY2
As expected, the compiler output identical asm for array[i++]
and array[i]; i++
with at least -O1
. However what surprised me was the placement of the xor eax, eax
seemingly randomly in the function at higher optimization levels.
At -O2
, GCC places the xor
before the ret
as expected
mov DWORD PTR [rax], 0
add rax, 4
cmp rax, OFFSET FLAT:array+40
jne .L2
xor eax, eax
ret
However it places the xor after the second mov
at -O3
mov QWORD PTR array[rip], 0
mov QWORD PTR array[rip+8], 0
xor eax, eax
mov QWORD PTR array[rip+16], 0
mov QWORD PTR array[rip+24], 0
mov QWORD PTR array[rip+32], 0
ret
icc places it normally at -O1
:
push rsi
xor esi, esi
push 3
pop rdi
call __intel_new_feature_proc_init
stmxcsr DWORD PTR [rsp]
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
..B1.2:
mov DWORD PTR [array+rax*4], 0
inc rax
cmp rax, 10
jl ..B1.2
xor eax, eax
pop rcx
ret
but in a strange place at -O2
push rbp
mov rbp, rsp
and rsp, -128
sub rsp, 128
xor esi, esi
mov edi, 3
call __intel_new_feature_proc_init
stmxcsr DWORD PTR [rsp]
pxor xmm0, xmm0
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
movdqu XMMWORD PTR array[rip], xmm0
movdqu XMMWORD PTR 16+array[rip], xmm0
mov DWORD PTR 32+array[rip], eax
mov DWORD PTR 36+array[rip], eax
mov rsp, rbp
pop rbp
ret
and -O3
and rsp, -128
sub rsp, 128
mov edi, 3
call __intel_new_proc_init
stmxcsr DWORD PTR [rsp]
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
mov rsp, rbp
pop rbp
ret
only clang places the xor
directly in front of the ret
at all optimization levels:
xorps xmm0, xmm0
movaps xmmword ptr [rip + array+16], xmm0
movaps xmmword ptr [rip + array], xmm0
mov qword ptr [rip + array+32], 0
xor eax, eax
ret
Since GCC and ICC both do this at higher optimisation levels, I presume there must be some kind of good reason.
Why do some compilers do this?
The code is semantically identical of course and the compiler can reorder it as it wishes, but since this only changes at higher optimization levels this must be caused by some kind of optimization.
A return statement ends the execution of a function, and returns control to the calling function. Execution resumes in the calling function at the point immediately following the call. A return statement can return a value to the calling function.
The return value of main() function shows how the program exited. The normal exit of program is represented by zero return value. If the code has errors, fault etc., it will be terminated by non-zero value. In C++ language, the main() function can be left without return value.
You aren't passing the length parameter to get length, and even if you did, you are passing length by value so it won't be changed. Pass by reference.
By default the main function return “0” because main function's default return type is “int”. main function return type is integer by default.
Since eax
isn't used, compilers can zero the register whenever they want, and it works as expected.
An interesting thing that you didn't notice is the icc
-O2
version:
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
movdqu XMMWORD PTR array[rip], xmm0
movdqu XMMWORD PTR 16+array[rip], xmm0
mov DWORD PTR 32+array[rip], eax ; set to 0 using the value of eax
mov DWORD PTR 36+array[rip], eax
notice that eax
is zeroed for the return value, but also used to zero 2 memory regions (last 2 instructions), probably because the instruction using eax
is shorter than the instruction with the immediate zero operand.
So two birds with one stone.
Different instructions have different latencies. Sometimes changing the order of instructions can speed up the code for several reasons. For example: If a certain instruction takes several cycles to complete, if it is at the end of the function the program just waits until it is done. If it is earlier in the function other things can happen while that instruction finishes. That is unlikely the actual reason here, though, on second thought, as xor of registers is I believe a low-latency instruction. Latencies are processor dependent though.
However, placing the XOR there may have to do with separating the mov instructions between which it is placed.
There are also optimizations that take advantage of the optimization capabilities of modern processors such as pipelining, branch prediction (not the case here as far as I can see....), etc. You need a pretty deep understanding of these capabilities to understand what an optimizer may do to take advantage of them.
You might find this informative. It pointed me to Agner Fog's site, a resource I have not seen before but has a lot of the information you wanted (or didn't want :-) ) to know but were afraid to ask :-)
Those memory accesses are expected to burn at least several clock cycles. You can move the xor without changing the functionality of the code. By pulling it back with one/some memory accesses after it it becomes free, doesnt cost you any execution time it is parallel with the external access (the processor finishes the xor and waits on the external activity rather than just waits on the external activity). If you put it in a clump of instructions without memory accesses it costs a clock at least. And as you probably know using the xor vs mov immediate reduces the size of the instruction, probably not costing clocks but saving space in the binary. A ghee whiz kinda cool optimization that dates back to the original 8086, and is still used today even if it doesnt save you much in the end.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With