I'm reading Computer Systems: A Programmer's Perspective and the homework was to describe how this algorithm works.
C function:
void store_prod(__int128 *dest, int64_t x, int64_t y) {
*dest = x * (__int128)y;
}
Assembly:
movq %rdx, %rax
cqto
movq %rsi, %rcx
sarq $63, %rcx
imulq %rax, %rcx
imulq %rsi, %rdx
addq %rdx, %rcx
mulq %rsi
addq %rcx, %rdx
movq %rax, (%rdi)
movq %rdx, 8(%rdi)
ret
I don't know why it performs: xh * yl + yh * xl = value which we add after unsigned multiplication
As always, compiler options matter. That source code with gcc -Og
(optimize for debugging) produces very similar asm to your listing (the cast sign-extends both operands to 128-bit before doing a full 128x128 => 128-bit multiply). This is a naive implementation of exactly what the C standard says should happen (integer precedence rules for converting both operands to the same type).
If you're going to talk about compiler output, you should always say which version of which compiler with what options. Or just post a link to it on godbolt, like the one above. (Edit: oops, source and asm were from a book that didn't give that info. And if that's the global edition of CS:APP 3e, beware that the practice problems are filled with errors in the global edition.)
With gcc -O3
or -O2
, GCC takes advantage of the fact that both operands are still really only 64bit, so a single imul
is enough. (This still produces the same result for every input, and thus still implements the C logic, per the as-if rule. C doesn't have widening operations so you're forced to write the source in an "inefficient" way that depends on the compiler to transform it into efficient asm.)
The sar $63, %rcx
is part of sign-extending rsi
into rcx:rsi
, just like cqto
sign-extends rax
into rdx:rax
. It replaces every bit of RCX with a copy of the original sign bit.
Most of this answer was already given by other people in comments, but I don't think anyone else noticed that gcc -Og
/ -O1
gives almost exactly that asm output.
What GCC is doing is using the property that signed multiplication can be done using the following formula.
(hi,lo) = unsigned(x*y)
hi -= ((x<0) ? y : 0) + ((y<0) ? x : 0)
Despite the fact that there is no need to do this since in this case the x86-64 instruction set has a signed 64-bit*64-bit to 128-bit instruction (imul
with one operand) this formula is useful in other cases. For example for implementing signed 128-bit multiplication with SSE2/AVX2/AVX512 or for implementing 256-bit multiplication when the instruction set only does 128-bit multiplication (such as with x86-64).
GCC implemented this formula a bit differently though. If we take the sign bit and extend it to the whole word, call this function sign_ext
, then the function returns -1
or 0
. Then what GCC did is:
hi += sign_ext(x)*y + sign_ext(y)*x
for example sign_ext(x)*y
in pseudo-instructions for 64-bit words is
sarq $63, x ; sign_ext(x)
imulq y, x ; sign_ext(x)*y
So now you ask (or meant to ask):
Why is this formula true?
That's a good qeustion. I asked this same question as well and njuffa wrote
@Zboson: It follows directly from two's complement complement representation. E.g. 32-bit integers
-n
and-m
are represented as unsigned numbersx=2**32-n, y=2**32-m
. If you multiply those you havex*y = 2**64 - 2**32*n - 2**32*m + n*m
. The middle terms indicate the necessary corrections to the upper half of the product. Working through a simple example using -1*-1 should prove very instructive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With