In this disassembly from VC++ a function call is being made. The compiler MOVs the local pointers to a register before pushing them:
memcpy( nodeNewLocation, pNode, sizeCurrentNode );
0041A5DA 8B 45 F8 mov eax,dword ptr [ebp-8]
0041A5DD 50 push eax
0041A5DE 8B 4D 0C mov ecx,dword ptr [ebp+0Ch]
0041A5E1 51 push ecx
0041A5E2 8B 55 D4 mov edx,dword ptr [ebp-2Ch]
0041A5E5 52 push edx
0041A5E6 E8 67 92 FF FF call 00413852
0041A5EB 83 C4 0C add esp,0Ch
Why not just push them directly? ie
push dword ptr [ebp-8]
Also, if you are going to do a separate push, why not do it manually. In other words, instead of doing "push eax" above, do
mov [esp], eax
Etc. the advantage of this is that after doing the 3 movs you can do a single subtract to set the new stack pointer, instead of implicitly subtracting three times with the pushes.
UPDATE---Release version
This is the same code compiled for release:
; 741 : memcpy( nodeNewLocation, pNode, sizeCurrentNode );
00087 8b 45 f8 mov eax, DWORD PTR _sizeCurrentNode$[ebp]
0008a 8b 7b 04 mov edi, DWORD PTR [ebx+4]
0008d 50 push eax
0008e 56 push esi
0008f 57 push edi
00090 e8 00 00 00 00 call _memcpy
00095 83 c4 0c add esp, 12 ; 0000000cH
Definitely more efficient than the debug version, but it is still doing a MOV/PUSH combo.
This is an optimization. It is explicitly mentioned in the Intel processor manuals, volume 4, section 12.3.3.6:
In Intel Atom microarchitecture, using PUSH/POP instructions to manage stack space and address adjustment between function calls/returns will be more optimal than using ENTER/LEAVE alternatives. This is because PUSH/POP will not need MSROM flows and stack pointer address update is done at AGU. When a callee function need to return to the caller, the callee could issue POP instruction to restore data and restore the stack pointer from the EBP.
Assembly/Compiler Coding Rule 19. (MH impact, M generality) For Intel Atom processors, favor register form of PUSH/POP and avoid using LEAVE; Use LEA to adjust ESP instead of ADD/SUB.
The rest of the manual isn't that clear about the reason, but it does mention a possible 3 cycle AGU stall on implicit ESP adjustments.
I suspect it only does it in debug builds, or possibly in some situations where it's warranted by pipelining or other considerations (e.g. it could put a parameter into esi
and use it after the call). I've looked into some binaries, and MSVC definitely does use such pushes:
push ebx ; mthd
push dword ptr [ebp+place+4]
push dword ptr [ebp+place] ; pos
push [ebp+filedes] ; fh
call __lseeki64_nolock
(code from the CRT)
As for the second question, instructions addressing esp
are longer than pushes: "push eax"
is one byte while "mov [esp-8], eax"
is four bytes. In fact, this approach (mov
instead of push
) is used by GCC by default since a couple versions ago (option -maccumulate-outgoing-args
) and it has led to notable increases in code size. Supposedly it makes code faster but I'm unconvinced.
I actually figured out the reason for it. It has to do with the way instructions are pipelined on the Pentium MMX. There are two pipelines, U and V, which allows MMX processors to process 2 instructions at a time IF they are pairable. PUSHs are not pairable with one another, but they are pairable with MOVs. So, if you write:
mov eax, [indirect]
mov esi, [indirect]
push eax
push esi
then, what happens is that instructions #1 and #3 get paired and #2 and #4 get paired so, effectively, these four instructions run in the same number of cycles as a single mov/push, and a single mov/push is faster than two push [indirect]s. This exact case is described in detail in Section 4.3, p. 41, Examples 4.11a and 4.11b, of the Microarchitecture optimization guide by Agner Fog, available widely on internet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With