I am trying to write inline x86-64 assembly for GCC to efficiently use the MULQ instruction. MULQ multiplies the 64-bit register RAX with another 64-bit value. The other value can be any 64-bit register (even RAX) or a value in memory. MULQ puts the high 64 bits of the product into RDX and the low 64 bits into RAX.
Now, it's easy enough to express a correct mulq as inline assembly:
#include <stdint.h>
static inline void mulq(uint64_t *high, uint64_t *low, uint64_t x, uint64_t y)
{
asm ("mulq %[y]"
: "=d" (*high), "=a" (*low)
: "a" (x), [y] "rm" (y)
);
}
This code is correct, but not optimal. MULQ is commutative, so if y
happened to be in RAX already, then it would be correct to leave y
where it is and do the multiply. But GCC doesn't know that, so it will emit extra instructions to move the operands into their pre-defined places. I want to tell GCC that it can put either input in either location, as long as one ends up in RAX and the MULQ references the other location. GCC has a syntax for this, called "multiple alternative constraints". Notice the commas (but the overall asm() is broken; see below):
asm ("mulq %[y]"
: "=d,d" (*high), "=a,a" (*low)
: "a,rm" (x), [y] "rm,a" (y)
);
Unfortunately, this is wrong. If GCC chooses the second alternative constraint, it will emit "mulq %rax". To be clear, consider this function:
uint64_t f()
{
uint64_t high, low;
uint64_t rax;
asm("or %0,%0": "=a" (rax));
mulq(&high, &low, 7, rax);
return high;
}
Compiled with gcc -O3 -c -fkeep-inline-functions mulq.c
, GCC emits this assembly:
0000000000000010 <f>:
10: or %rax,%rax
13: mov $0x7,%edx
18: mul %rax
1b: mov %rdx,%rax
1e: retq
The "mul %rax" should be "mul %rdx".
How can this inline asm be rewritten so it generates the correct output in every case?
This 2012's question is still very relevant in 2019. Although gcc has changed and some code generated was not optimal back in 2012 but is now, the other way around also holds.
Inspired by Whitlock's analysis, I've tested mulq
in 9 different cases where each of x
and y
is either a constant (5
, 6
) or a value in memory (bar
, zar
) or a value in rax
(f1()
, f2()
):
uint64_t h1() { uint64_t h, l; mulq(&h, &l, 5, 6); return h + l; }
uint64_t h2() { uint64_t h, l; mulq(&h, &l, 5, bar); return h + l; }
uint64_t h3() { uint64_t h, l; mulq(&h, &l, 5, f1()); return h + l; }
uint64_t h4() { uint64_t h, l; mulq(&h, &l, bar, 5); return h + l; }
uint64_t h5() { uint64_t h, l; mulq(&h, &l, bar, zar); return h + l; }
uint64_t h6() { uint64_t h, l; mulq(&h, &l, bar, f1()); return h + l; }
uint64_t h7() { uint64_t h, l; mulq(&h, &l, f1(), 5); return h + l; }
uint64_t h8() { uint64_t h, l; mulq(&h, &l, f1(), bar); return h + l; }
uint64_t h9() { uint64_t h, l; mulq(&h, &l, f1(), f2()); return h + l; }
I've tested 5 implementations: Staufk, Whitlock, Hale, Burdo and my own:
inline void mulq(uint64_t *high, uint64_t *low, uint64_t x, uint64_t y) {
asm("mulq %[y]" : [a]"=a,a"(*low), "=d,d"(*high) : "%a,rm"(x), [y]"rm,a"(y) : "cc");
}
All implementation are still unable to produce optimal code in all cases. Whilst other's fail to produce optimal code for h3,
h4
and h6
, Whitlock's and mine fail only for h3
:
h3():
callq 4004d0 <f1()>
mov %rax,%r8
mov $0x5,%eax
mul %r8
add %rdx,%rax
retq
Everything else being equal, one can see that mine is simpler than Whitlock's. With an extra level of indirection and using a gcc's built in function (also available in clang but I haven't tested) it's possible to get optimal h3
by calling this function instead of mulq
:
inline void mulq_fixed(uint64_t* high, uint64_t* low, uint64_t x, uint64_t y) {
if (__builtin_constant_p(x))
mulq(high, low, y, x);
else
mulq(high, low, x, y);
}
yields:
h3():
callq 4004d0 <f1()>
mov $0x5,%edx
mul %rdx
add %rdx,%rax
retq
The idea of using __builtin_constant_p
was actually taken from gcc's doc:
There is no way within the template to determine which alternative was chosen. However you may be able to wrap your asm statements with builtins such as __builtin_constant_p to achieve the desired results.
See for yourself in Compiler Explorer.
Note: There's another smaller and unexpected disadvantage of Whitlock's implementation. You need to check option 11010 in Compiler Explorer otherwise the output is misleading and functions h1
, ..., h9
appear to use instruction mulq
twice. This is because Compiler Explorer's parser does not process the assembler directives .ifnc
/.else
/.endif
properly and simply remove them, showing both possible paths (the .if
's and the .else
's). Alternatively, you can uncheck option .text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With