Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can GCC emit different instruction mnemonics when choosing between multiple alternative operand constraints of inline assembly?

I am trying to write inline x86-64 assembly for GCC to efficiently use the MULQ instruction. MULQ multiplies the 64-bit register RAX with another 64-bit value. The other value can be any 64-bit register (even RAX) or a value in memory. MULQ puts the high 64 bits of the product into RDX and the low 64 bits into RAX.

Now, it's easy enough to express a correct mulq as inline assembly:

#include <stdint.h>
static inline void mulq(uint64_t *high, uint64_t *low, uint64_t x, uint64_t y)
{
    asm ("mulq %[y]" 
          : "=d" (*high), "=a" (*low)
          : "a" (x), [y] "rm" (y)    
        );
}

This code is correct, but not optimal. MULQ is commutative, so if y happened to be in RAX already, then it would be correct to leave y where it is and do the multiply. But GCC doesn't know that, so it will emit extra instructions to move the operands into their pre-defined places. I want to tell GCC that it can put either input in either location, as long as one ends up in RAX and the MULQ references the other location. GCC has a syntax for this, called "multiple alternative constraints". Notice the commas (but the overall asm() is broken; see below):

asm ("mulq %[y]" 
      : "=d,d" (*high), "=a,a" (*low)
      : "a,rm" (x), [y] "rm,a" (y)    
    );

Unfortunately, this is wrong. If GCC chooses the second alternative constraint, it will emit "mulq %rax". To be clear, consider this function:

uint64_t f()
{
    uint64_t high, low;
    uint64_t rax;
    asm("or %0,%0": "=a" (rax));
    mulq(&high, &low, 7, rax);
    return high;
}

Compiled with gcc -O3 -c -fkeep-inline-functions mulq.c, GCC emits this assembly:

0000000000000010 <f>:
  10:   or     %rax,%rax
  13:   mov    $0x7,%edx
  18:   mul    %rax
  1b:   mov    %rdx,%rax
  1e:   retq

The "mul %rax" should be "mul %rdx".

How can this inline asm be rewritten so it generates the correct output in every case?

like image 986
staufk Avatar asked Nov 29 '12 02:11

staufk


1 Answers

This 2012's question is still very relevant in 2019. Although gcc has changed and some code generated was not optimal back in 2012 but is now, the other way around also holds.

Inspired by Whitlock's analysis, I've tested mulq in 9 different cases where each of x and y is either a constant (5, 6) or a value in memory (bar, zar) or a value in rax (f1(), f2()):

uint64_t h1() { uint64_t h, l; mulq(&h, &l,    5,    6); return h + l; }
uint64_t h2() { uint64_t h, l; mulq(&h, &l,    5,  bar); return h + l; }
uint64_t h3() { uint64_t h, l; mulq(&h, &l,    5, f1()); return h + l; }
uint64_t h4() { uint64_t h, l; mulq(&h, &l,  bar,    5); return h + l; }
uint64_t h5() { uint64_t h, l; mulq(&h, &l,  bar,  zar); return h + l; }
uint64_t h6() { uint64_t h, l; mulq(&h, &l,  bar, f1()); return h + l; }
uint64_t h7() { uint64_t h, l; mulq(&h, &l, f1(),    5); return h + l; }
uint64_t h8() { uint64_t h, l; mulq(&h, &l, f1(),  bar); return h + l; }
uint64_t h9() { uint64_t h, l; mulq(&h, &l, f1(), f2()); return h + l; }

I've tested 5 implementations: Staufk, Whitlock, Hale, Burdo and my own:

inline void mulq(uint64_t *high, uint64_t *low, uint64_t x, uint64_t y) {
    asm("mulq %[y]" : [a]"=a,a"(*low), "=d,d"(*high) : "%a,rm"(x), [y]"rm,a"(y) : "cc");
}

All implementation are still unable to produce optimal code in all cases. Whilst other's fail to produce optimal code for h3, h4 and h6, Whitlock's and mine fail only for h3:

h3():
 callq 4004d0 <f1()>
 mov %rax,%r8
 mov $0x5,%eax
 mul %r8
 add %rdx,%rax
 retq 

Everything else being equal, one can see that mine is simpler than Whitlock's. With an extra level of indirection and using a gcc's built in function (also available in clang but I haven't tested) it's possible to get optimal h3 by calling this function instead of mulq:

inline void mulq_fixed(uint64_t* high, uint64_t* low, uint64_t x, uint64_t y) {
    if (__builtin_constant_p(x))
        mulq(high, low, y, x);
    else
        mulq(high, low, x, y);
}

yields:

h3():
 callq 4004d0 <f1()>
 mov $0x5,%edx
 mul %rdx
 add %rdx,%rax
 retq 

The idea of using __builtin_constant_p was actually taken from gcc's doc:

There is no way within the template to determine which alternative was chosen. However you may be able to wrap your asm statements with builtins such as __builtin_constant_p to achieve the desired results.

See for yourself in Compiler Explorer.

Note: There's another smaller and unexpected disadvantage of Whitlock's implementation. You need to check option 11010 in Compiler Explorer otherwise the output is misleading and functions h1, ..., h9 appear to use instruction mulq twice. This is because Compiler Explorer's parser does not process the assembler directives .ifnc/.else/.endif properly and simply remove them, showing both possible paths (the .if's and the .else's). Alternatively, you can uncheck option .text.

like image 62
Cassio Neri Avatar answered Oct 03 '22 02:10

Cassio Neri