In x86 assembly, is it possible to remove a value from the stack without storing it? Something along the lines of <code>pop word null</code>? I could obviously use <code>add esp,4</code>, but maybe there's a nice and clean cisc mnemonic i'm missing?

<code>add esp,4</code> / <code>add rsp,8</code> is the normal / idiomatic / clean way. No special way is needed because stacks aren't magical or special (at least not in this respect); it's just a pointer in a register with some instructions that use it implicitly. (And for kernel stacks, interrupts use it asynchronously so software couldn't implement a kernel red-zone even if it wanted to...) Other than that, the magical CISC way to clean up a whole stack frame at the end of a function is <code>leave</code> = <code>mov esp, ebp</code> / <code>pop ebp</code> (or the 16 or 64-bit equivalent). Unlike <code>enter</code>, it's fast enough on modern CPUs to be usable in practice, but still a 3 uop instruction on Intel CPUs. (http://agner.org/optimize/). But <code>leave</code> only works in the first place if you spent extra instructions making a stack frame with <code>ebp</code> / <code>rbp</code> in the first place. (Usually you wouldn't do that, unless you need to reserve a variable amount of stack space, e.g. with <code>push</code> in a loop to make an array, or the equivalent of a C99 VLA or <code>alloca</code>. Or for beginner code to make access to locals easier, or in 16-bit mode where <code>SP</code> can't be used in addressing modes.) The magical CISC way to clean up stack-args is for the callee to use <code>ret imm16</code> (costing 1 extra uop) to pop the args, creating a calling convention where the callee cleans the stack. In a caller-pops calling convention, there's no way to use this form of <code>ret</code>, but you can simply leave the stack offset and use <code>mov</code> to store args for the next function call instead of <code>push</code> (if the function needs any stack-args at all; register-arg calling conventions are generally more efficient.) So the magic CISC ways have no performance advantage on modern CPUs, only minor code-size. <hr> There are 2 reasons you might use <code>pop reg</code> instead of <code>add esp,4</code>: <ul> <li>code-size: <code>pop r32/r64</code> is a one-byte instruction, vs. 3 bytes for <code>add esp,4</code> or 4 bytes for <code>add rsp,8</code>.</li> <li> performance: Intel's stack engine has to insert extra stack-sync uops when you use <code>esp</code> / <code>rsp</code> explicitly after a stack instruction (push/pop/call/ret). So after a <code>call</code> (which returns with a <code>ret</code>), it saves a uop to use <code>pop</code> instead of <code>add esp,4</code> before you <code>ret</code> at the end of the function. AMD's stack engine doesn't need extra stack-sync uops, but still makes push/pop single-uop instructions. Unlike on older Intel/AMD CPUs, where push/pop cost more than plain <code>mov</code> loads/stores, needing a separate uop for the stack-pointer modification. And creating a data dependency on the stack pointer. </li> </ul> See Why does this function push RAX to the stack as the first operation? for more details about performance. If you were looking for aesthetics, well you can indent, format, and comment your code nicely, but beyond you chose the wrong language when you picked x86 asm if aesthetics outweigh optimization. <hr> Of course, if you need to adjust the stack by more than 1 register-width, definitely use <code>add</code> if you don't need the data that <code>pop</code> would load. Or, if you need to adjust it by +128 bytes, use <code>sub esp, -128</code>, because <code>-128</code> is encodable as a sign-extended imm8, but +128 isn't. Or maybe use <code>lea esp, [esp+4]</code>, like gcc does with <code>-mtune=atom</code>. (For in-order atom, not silvermont). Like I said, if you wanted clean, you shouldn't have picked x86 asm. <hr> You can almost always find a dead register to <code>pop</code> into. If you need to adjust E/RSP by one stack slot before popping some registers you actually wanted to pop, you can always pop the same register twice. In the extremely rare case where none of the 7 (x86-32) or 15 (x86-64) non-stack register are available as <code>pop</code> destinations, this optimization is not available and you should simply use the traditional <code>add</code>. It's not worth spending extra instructions to make it possible to <code>pop</code>; that would outweigh the minor benefit of using <code>pop</code>. Note that <code>pop Sreg</code> (segment register) still consumes the regular "stack width" (32 or 64 bits, depending on mode), rather than only 16 for a 16-bit register. But only <code>pop ds/es/ss</code> are single-byte. <code>pop fs/gs</code> are 2 bytes each. So if you're optimizing for code-size, <code>pop gs</code> is 1 byte smaller than <code>add esp,4</code>, but much much slower. (Or 2 bytes smaller than <code>add rsp,8</code>).

x86 assembly: Pop a value without storing it

Tags:

stack

x86

assembly

callstack

stack-pointer

In x86 assembly, is it possible to remove a value from the stack without storing it? Something along the lines of pop word null? I could obviously use add esp,4, but maybe there's a nice and clean cisc mnemonic i'm missing?

224

asked Feb 09 '18 12:02

NeoTheThird

1 Answers

add esp,4 / add rsp,8 is the normal / idiomatic / clean way. No special way is needed because stacks aren't magical or special (at least not in this respect); it's just a pointer in a register with some instructions that use it implicitly. (And for kernel stacks, interrupts use it asynchronously so software couldn't implement a kernel red-zone even if it wanted to...)

Other than that, the magical CISC way to clean up a whole stack frame at the end of a function is leave = mov esp, ebp / pop ebp (or the 16 or 64-bit equivalent). Unlike enter, it's fast enough on modern CPUs to be usable in practice, but still a 3 uop instruction on Intel CPUs. (http://agner.org/optimize/). But leave only works in the first place if you spent extra instructions making a stack frame with ebp / rbp in the first place. (Usually you wouldn't do that, unless you need to reserve a variable amount of stack space, e.g. with push in a loop to make an array, or the equivalent of a C99 VLA or alloca. Or for beginner code to make access to locals easier, or in 16-bit mode where SP can't be used in addressing modes.)

The magical CISC way to clean up stack-args is for the callee to use ret imm16 (costing 1 extra uop) to pop the args, creating a calling convention where the callee cleans the stack. In a caller-pops calling convention, there's no way to use this form of ret, but you can simply leave the stack offset and use mov to store args for the next function call instead of push (if the function needs any stack-args at all; register-arg calling conventions are generally more efficient.)

So the magic CISC ways have no performance advantage on modern CPUs, only minor code-size.

There are 2 reasons you might use pop reg instead of add esp,4:

code-size: pop r32/r64 is a one-byte instruction, vs. 3 bytes for add esp,4 or 4 bytes for add rsp,8.
performance: Intel's stack engine has to insert extra stack-sync uops when you use esp / rsp explicitly after a stack instruction (push/pop/call/ret). So after a call (which returns with a ret), it saves a uop to use pop instead of add esp,4 before you ret at the end of the function.

AMD's stack engine doesn't need extra stack-sync uops, but still makes push/pop single-uop instructions. Unlike on older Intel/AMD CPUs, where push/pop cost more than plain mov loads/stores, needing a separate uop for the stack-pointer modification. And creating a data dependency on the stack pointer.

See Why does this function push RAX to the stack as the first operation? for more details about performance.

If you were looking for aesthetics, well you can indent, format, and comment your code nicely, but beyond you chose the wrong language when you picked x86 asm if aesthetics outweigh optimization.

Of course, if you need to adjust the stack by more than 1 register-width, definitely use add if you don't need the data that pop would load. Or, if you need to adjust it by +128 bytes, use sub esp, -128, because -128 is encodable as a sign-extended imm8, but +128 isn't.

Or maybe use lea esp, [esp+4], like gcc does with -mtune=atom. (For in-order atom, not silvermont). Like I said, if you wanted clean, you shouldn't have picked x86 asm.

You can almost always find a dead register to pop into. If you need to adjust E/RSP by one stack slot before popping some registers you actually wanted to pop, you can always pop the same register twice.

In the extremely rare case where none of the 7 (x86-32) or 15 (x86-64) non-stack register are available as pop destinations, this optimization is not available and you should simply use the traditional add. It's not worth spending extra instructions to make it possible to pop; that would outweigh the minor benefit of using pop.

Note that pop Sreg (segment register) still consumes the regular "stack width" (32 or 64 bits, depending on mode), rather than only 16 for a 16-bit register. But only pop ds/es/ss are single-byte. pop fs/gs are 2 bytes each. So if you're optimizing for code-size, pop gs is 1 byte smaller than add esp,4, but much much slower. (Or 2 bytes smaller than add rsp,8).

175

answered Nov 22 '22 04:11

Peter Cordes

Related questions
                            
                                Compiler using local variables without adjusting RSP
                            
                                How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
                            
                                Emulating variable bit-shift using only constant shifts?
                            
                                Where does at&t assembly syntax come from?
                            
                                Optimizing variable-length encoding
                            
                                Good online resources to learn x86 assembly [closed]
                            
                                Difference between load word and move?
                            
                                Fast Division on GCC/ARM
                            
                                Unsigned 64-bit to double conversion: why this algorithm from g++
                            
                                Difference in ABI between x86_64 Linux functions and syscalls
                            
                                Where are the stacks for the other threads located in a process virtual address space?
                            
                                Fastest way to expand bits in a field to all (overlapping + adjacent) set bits in a mask?
                            
                                GCC Inline Assembly: Jump to label outside block
                            
                                Why does MSVC not support inline assembly for AMD64 and Itanium targets?
                            
                                How to generate assembly code from C++ source in Visual Studio 2010
                            
                                What is a paragraph (when referring to memory)
                            
                                Assembler jump in Protected Mode with GDT
                            
                                Why does jmpq of x86-64 only need 32-bit length address?
                            
                                What does qword ptr [hexvalue] mean without a base register
                            
                                Atomic operations, std::atomic<> and ordering of writes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With