I have been recently doing a lot of x64 assembly programming (on Linux) for integration with my C/C++ programs.
Since I am mostly concerned about efficiency I like to use as few different regs/memory addresses as possible as well as trying not to create any stack frames or preserve registers (every cycle counts).
According to the cdecl r10 and r11 registers are not preserved and I wish to use them as temporary variables in my functions preferably without preserving. Does it cause any incomparability issues / bugs with any compiler (haven't experienced any so far but it is a concern)?
The x86-64 System V ABI doesn't call its calling convention "cdecl". It's just the x86-64 SysV calling convention. The string "cdecl" doesn't appear in the ABI doc.
r11
is a temporary, aka call-clobbered register.
r10
is also a call-clobbered register. The ABI says "used for passing a function’s static chain pointer", but C doesn't use this and code generated by gcc and clang does freely use r10
without saving/restoring it. The ABI's table of register usage lists r10
as not preserved across function calls so a leaf function can always clobber it. (Which registers to use as temporaries when writing AMD64 SysV assembly?)
gcc does use r10
as part of its trampoline for function pointers to GNU C nested functions, for a pointer to the stack frame of the outer scope. The trampoline of machine code on the stack is a hack, but this is indeed a static chain pointer; languages with proper support for nested functions would probably have the caller aware of it (like a lambda / closure) and passing a value in r10
when using using pointer to a nested function.
Non-leaf functions do not need to pass on their incoming r10
to their children unless they're "nested functions" in a language that supports that sort of thing (not C or C++). Therefore r10
is also a pure temporary in normal circumstances.
r10
and r11
are not arg-passing registers, unlike the other call-clobbered registers, so "wrapper" functions can use them (especially r11
) without saving/restoring anything.
In a normal function, RBX, RBP, and RSP are call-preserved, along with R12..R15. All others can be clobbered without saving/restoring. (That includes xmm/ymm0..15 and zmm0..31, and the x87 stack, and the condition codes in RFLAGS).
Note that r8..15
need a REX prefix, even with 32-bit operand-size (like xor r10d, r10d
). If you have some 64-bit non-pointer integers, then sure keep them in r8..r11 because you always need a REX prefix for 64-bit operand-size any time you use those values anyway.
Smaller code-size is usually not worse, and sometimes helps with decode and uop-cache density, and L1i cache density. RAX, RCX,RDX, RSI,RDI should be your first choices for scratch regs. (And use 32-bit operand-size unless you need 64-bit. e.g. xor eax,eax
is the correct way to zero RAX. Silvermont doesn't recognize xor r10,r10
as a zeroing idiom, so use xor r10d,r10d
even though it doesn't save code size.)
If you do run out of low registers, ideally use r10
/ r11
for things that will normally be used with 64-bit operand-size (or VEX prefixes) anyway. e.g. pointers to 64-bit data or pointers to pointers. mov eax, [r10]
needs a REX prefix while mov eax, [rdi]
doesn't. But mov rax, [rdi]
and mov r8, [r10]
are the same size.
It's hard to gain much because you often need to use different values together in different combinations, like eventually using cmp eax, r10d
or whatever, but if you want to go all-out on optimizing, then think about code-size. Maybe also think about where the instruction boundaries are and how it will fit into the uop cache.
See the x86 tag wiki, and especially http://agner.org/optimize/ for tips on writing efficient code.
You can use r10 and r11 as freely as rcx and rdx.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With