Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the VC++ compiler MOV+PUSH args instead of just PUSH them? x86

In this disassembly from VC++ a function call is being made. The compiler MOVs the local pointers to a register before pushing them:

    memcpy( nodeNewLocation, pNode, sizeCurrentNode );
0041A5DA 8B 45 F8             mov         eax,dword ptr [ebp-8]  
0041A5DD 50                   push        eax  
0041A5DE 8B 4D 0C             mov         ecx,dword ptr [ebp+0Ch]  
0041A5E1 51                   push        ecx  
0041A5E2 8B 55 D4             mov         edx,dword ptr [ebp-2Ch]  
0041A5E5 52                   push        edx  
0041A5E6 E8 67 92 FF FF       call        00413852  
0041A5EB 83 C4 0C             add         esp,0Ch 

Why not just push them directly? ie

push  dword ptr [ebp-8]

Also, if you are going to do a separate push, why not do it manually. In other words, instead of doing "push eax" above, do

mov [esp], eax

Etc. the advantage of this is that after doing the 3 movs you can do a single subtract to set the new stack pointer, instead of implicitly subtracting three times with the pushes.

UPDATE---Release version

This is the same code compiled for release:

; 741  :    memcpy( nodeNewLocation, pNode, sizeCurrentNode );

  00087 8b 45 f8     mov     eax, DWORD PTR _sizeCurrentNode$[ebp]
  0008a 8b 7b 04     mov     edi, DWORD PTR [ebx+4]
  0008d 50       push    eax
  0008e 56       push    esi
  0008f 57       push    edi
  00090 e8 00 00 00 00   call    _memcpy
  00095 83 c4 0c     add     esp, 12            ; 0000000cH

Definitely more efficient than the debug version, but it is still doing a MOV/PUSH combo.

like image 886
Tyler Durden Avatar asked Oct 29 '12 15:10

Tyler Durden


3 Answers

This is an optimization. It is explicitly mentioned in the Intel processor manuals, volume 4, section 12.3.3.6:

In Intel Atom microarchitecture, using PUSH/POP instructions to manage stack space and address adjustment between function calls/returns will be more optimal than using ENTER/LEAVE alternatives. This is because PUSH/POP will not need MSROM flows and stack pointer address update is done at AGU. When a callee function need to return to the caller, the callee could issue POP instruction to restore data and restore the stack pointer from the EBP.

Assembly/Compiler Coding Rule 19. (MH impact, M generality) For Intel Atom processors, favor register form of PUSH/POP and avoid using LEAVE; Use LEA to adjust ESP instead of ADD/SUB.

The rest of the manual isn't that clear about the reason, but it does mention a possible 3 cycle AGU stall on implicit ESP adjustments.

like image 132
Hans Passant Avatar answered Nov 19 '22 02:11

Hans Passant


I suspect it only does it in debug builds, or possibly in some situations where it's warranted by pipelining or other considerations (e.g. it could put a parameter into esi and use it after the call). I've looked into some binaries, and MSVC definitely does use such pushes:

 push ebx          ; mthd
 push dword ptr [ebp+place+4]
 push dword ptr [ebp+place] ; pos
 push [ebp+filedes]   ; fh
 call __lseeki64_nolock

(code from the CRT)

As for the second question, instructions addressing esp are longer than pushes: "push eax" is one byte while "mov [esp-8], eax" is four bytes. In fact, this approach (mov instead of push) is used by GCC by default since a couple versions ago (option -maccumulate-outgoing-args) and it has led to notable increases in code size. Supposedly it makes code faster but I'm unconvinced.

like image 34
Igor Skochinsky Avatar answered Nov 19 '22 04:11

Igor Skochinsky


I actually figured out the reason for it. It has to do with the way instructions are pipelined on the Pentium MMX. There are two pipelines, U and V, which allows MMX processors to process 2 instructions at a time IF they are pairable. PUSHs are not pairable with one another, but they are pairable with MOVs. So, if you write:

mov eax, [indirect]
mov esi, [indirect]
push eax
push esi

then, what happens is that instructions #1 and #3 get paired and #2 and #4 get paired so, effectively, these four instructions run in the same number of cycles as a single mov/push, and a single mov/push is faster than two push [indirect]s. This exact case is described in detail in Section 4.3, p. 41, Examples 4.11a and 4.11b, of the Microarchitecture optimization guide by Agner Fog, available widely on internet.

like image 1
Tyler Durden Avatar answered Nov 19 '22 04:11

Tyler Durden