Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

x86 Assembly Why use Push/Pop instead of Mov?

I have some sample code from a shell code payload showing a for loop and using push/pop to set the counter:

push 9
pop ecx

Why can it not just use mov?

mov ecx, 9
like image 501
Hawke Avatar asked Jun 16 '19 12:06

Hawke


People also ask

Why do we use push and pop in assembly?

"push" stores a constant or 64-bit register out onto the stack. The 64-bit registers are the ones like "rax" or "r8", not the 32-bit registers like "eax" or "r8d". ("push eax" gives an error "instruction not supported in 64-bit mode"; use "push rax" instead.) "pop" retrieves the last value pushed from the stack.

What is the difference between MOV and push operation?

"push" will automatically bump the value of "esp" (your stack pointer). The "mov" won't. So i if you wanted to put multiple items on the stack, with push , you just do: push eax push ebx ...

What does Pop do in x86?

The pop instruction removes the 4-byte data element from the top of the hardware-supported stack into the specified operand (i.e. register or memory location).

Does MOV overwrite Assembly?

The MOV instruction overwrites the destination. That code doesn't add to anything. It overwrites the 10 bytes at the destination, replacing them with the 10 bytes at the source.


2 Answers

Yes normally you should always use mov ecx, 9 for performance reasons. It runs more efficiently than push/pop, as a single-uop instruction that can run on any port. (This is true across all existing CPUs that Agner Fog has tested: https://agner.org/optimize/)


The normal reason for push imm8 / pop r32 is that the machine code is free of zero bytes. This is important for shellcode that has to overflow a buffer via strcpy or any other method that treats it as part of an implicit-length C string terminated by a 0 byte.

mov ecx, immediate is only available with a 32-bit immediate, so the machine code will look like B9 09 00 00 00. vs. 6a 09 push 9 ; 59 pop ecx.

(ECX is register number 1, which is where B9 and 59 come from: the low 3 bits of the instruction = 001)


The other use-case is purely code-size: mov r32, imm32 is 5 bytes (using the no ModRM encoding that puts the register number in the low 3 bits of the opcode), because x86 unfortunately lacks a sign-extended imm8 opcode for mov (there's no mov r/m32, imm8). That exists for nearly all ALU instructions that date back to 8086.

In 16-bit 8086, that encoding wouldn't have saved any space: the 3-byte short-form mov r16, imm16 would be just as good as a hypothetical mov r/m16, imm8 for almost everything, except moving an immediate to memory where the mov r/m16, imm16 form (with a ModRM byte) is needed.

Since 386's 32-bit mode didn't add new opcodes specific to that mode, just changed the default operand-size and immediate widths, this "missed optimization" in the ISA in 32-bit mode started with 386. With full-width immediates being 2 bytes longer, an add r32,imm32 is now longer than an add r/m32, imm8. See x86 assembly 16 bit vs 8 bit immediate operand encoding. But we don't have that option for mov because there's no MOV opcode that sign-extends (or zero-extends) its immediate.

Fun fact: clang -Oz (optimize for size even at the expense of speed) will compile int foo(){return 9;} to push 9 ; pop rax. GCC12 also supports a similar -Oz.

See also Tips for golfing in x86/x64 machine code on Codegolf.SE (a site about optimizing for size usually for fun, rather than to fit code into a small ROM or boot sector. But for machine code, optimizing for size does have practical applications sometimes, even at the expense of performance.)

If you already had another register with known contents, creating 9 in another register can be done with 3-byte lea ecx, [eax-0 + 9] (if EAX holds 0). Just Opcode + ModRM + disp8. So you can avoid the push/pop hack if you already were going to xor-zero any other register. lea is barely less efficient than mov, and you could consider it when optimizing for speed because smaller code-size has minor speed benefits in the large scale: L1i cache hits, and sometimes decode if the uop cache isn't already hot.

like image 197
Peter Cordes Avatar answered Oct 13 '22 02:10

Peter Cordes


This may have different reasons.

In this case this seems to be done because the code is smaller:

The variant with the push and the pop combination is 3 bytes long, the mov instruction is 5 bytes long.

However, I would guess that the mov variant is faster ...

like image 2
Martin Rosenau Avatar answered Oct 13 '22 01:10

Martin Rosenau