I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc
that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.
Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp
), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).
To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers);
but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.
From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:
int global; was_leaf() { if (global) other(); }
GCC can't tell if global
will be true, so it can't optimize away the call to other()
so was_leaf()
is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp
and with the modification shown it did.
I also tried simply allocating more than 128 bytes (just char buf[150]
) in a leaf but I was shocked to see it only did a partial subtraction:
pushq %rbp movq %rsp, %rbp subq $40, %rsp movb $7, -155(%rbp)
If I put the leaf-defeating code back in that becomes subq $160, %rsp
The max-performance way might be to write the whole inner loop in asm (including the call
instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).
Anyway, have C call an asm function containing your optimized loop.
BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg
is still fairly efficient).
Have a look at the code gcc/clang wrap around an asm
statement that modifies an array element (on Godbolt):
void testloop(long *p, long count) { for (long i = 0 ; i < count ; i++) { asm(" # XXX asm operand in %0" : "+r" (p[i]) : : // "rax", "rbx", "rcx", "rdx", "rdi", "rsi", "rbp", "r8", "r9", "r10", "r11", "r12","r13","r14","r15" ); } } #gcc7.2 -O3 -march=haswell push registers and other function-intro stuff lea rcx, [rdi+rsi*8] ; end-pointer mov rax, rdi mov QWORD PTR [rsp-8], rcx ; store the end-pointer mov QWORD PTR [rsp-16], rdi ; and the start-pointer .L6: # rax holds the current-position pointer on loop entry # also stored in [rsp-16] mov rdx, QWORD PTR [rax] mov rax, rdx # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx XXX asm operand in rax mov rbx, QWORD PTR [rsp-16] # reload the pointer mov QWORD PTR [rbx], rax mov rax, rbx # another weird missed-optimization (lea rax, [rbx+8]) add rax, 8 mov QWORD PTR [rsp-16], rax cmp QWORD PTR [rsp-8], rax jne .L6 # cleanup omitted.
clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1
/ jnz
.
You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.
Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.
But of course XMM is not very viable for loop counters (paddd
/pcmpeq
/pmovmskb
/cmp
/jcc
or psubd
/ptest
/jcc
are not great compared to sub [mem], 1
/ jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq
sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.
If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp
to preserve the red-zone in those versions. (Note that -128
is encodeable as an imm8
but +128
isn't.)
Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm
statement is quite likely to trigger that behaviour.
// a non-leaf function that still uses the red-zone with gcc void bar(void) { //cryptofunc(1); // gcc/clang don't use the redzone after this (not future-proof) volatile int tmp = 1; (void)tmp; cryptofunc(1); // but gcc will use the redzone before a tailcall } # gcc7.2 -O3 output mov edi, 1 mov DWORD PTR [rsp-12], 1 mov eax, DWORD PTR [rsp-12] jmp cryptofunc(long)
If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp
before a call
). This is not future-proof at all, but should happen to work.
GNU C has an __attribute__((target("options")))
x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone
is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options")
within a compilation unit.
You can use stuff like
__attribute__(( target("sse4.1,arch=core2") )) void penryn_version(void) { ... }
but not __attribute__(( target("mno-red-zone") ))
.
There's a #pragma GCC optimize
and an optimize
function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone")
doesn't work either. I think the idea is to let some important functions be optimized with -O2
even in debug builds. You can set -f
options or -O
.
You could put the function in a file by itself and compile that compilation unit with -mno-red-zone
, though. (And hopefully LTO will not break anything...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With