Inline assembly that clobbers the red zone

Question

I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.

Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).

To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers); but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.

Ben Jackson · Accepted Answer

From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:

int global;  was_leaf() {     if (global) other(); }

GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.

I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:

    pushq   %rbp     movq    %rsp, %rbp     subq    $40, %rsp     movb    $7, -155(%rbp)

If I put the leaf-defeating code back in that becomes subq $160, %rsp

Peter Cordes · Answer

The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).

Anyway, have C call an asm function containing your optimized loop.

BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).

Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):

void testloop(long *p, long count) {   for (long i = 0 ; i < count ; i++) {     asm("  #    XXX  asm operand in %0"     : "+r" (p[i])     :     : // "rax",      "rbx", "rcx", "rdx", "rdi", "rsi", "rbp",       "r8", "r9", "r10", "r11", "r12","r13","r14","r15"     );   } }  #gcc7.2 -O3 -march=haswell      push registers and other function-intro stuff     lea     rcx, [rdi+rsi*8]      ; end-pointer     mov     rax, rdi         mov     QWORD PTR [rsp-8], rcx    ; store the end-pointer     mov     QWORD PTR [rsp-16], rdi   ; and the start-pointer  .L6:     # rax holds the current-position pointer on loop entry     # also stored in [rsp-16]     mov     rdx, QWORD PTR [rax]     mov     rax, rdx                 # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx           XXX  asm operand in rax      mov     rbx, QWORD PTR [rsp-16]   # reload the pointer     mov     QWORD PTR [rbx], rax     mov     rax, rbx            # another weird missed-optimization (lea rax, [rbx+8])     add     rax, 8     mov     QWORD PTR [rsp-16], rax     cmp     QWORD PTR [rsp-8], rax     jne     .L6    # cleanup omitted.

clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.

You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.

Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.

But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.

If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)

Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.

// a non-leaf function that still uses the red-zone with gcc void bar(void) {   //cryptofunc(1);  // gcc/clang don't use the redzone after this (not future-proof)    volatile int tmp = 1;   (void)tmp;   cryptofunc(1);  // but gcc will use the redzone before a tailcall }  # gcc7.2 -O3 output     mov     edi, 1     mov     DWORD PTR [rsp-12], 1     mov     eax, DWORD PTR [rsp-12]     jmp     cryptofunc(long)

If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.

GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.

You can use stuff like

__attribute__(( target("sse4.1,arch=core2") )) void penryn_version(void) {   ... }

but not __attribute__(( target("mno-red-zone") )).

There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.

You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)

Inline assembly that clobbers the red zone

Tags:

Mike Hamburg

2 Answers

Ben Jackson

Peter Cordes

Recent Activity

Donate For Us

Inline assembly that clobbers the red zone

Tags:

Mike Hamburg

2 Answers

Ben Jackson

Peter Cordes

Related questions

Recent Activity

Donate For Us