Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does this 128 bit integer multiplication work in assembly (x86-64)?

I'm reading Computer Systems: A Programmer's Perspective and the homework was to describe how this algorithm works.

C function:

void store_prod(__int128 *dest, int64_t x, int64_t y) {
    *dest = x * (__int128)y;
}

Assembly:

movq %rdx, %rax
cqto
movq  %rsi, %rcx
sarq  $63,  %rcx
imulq %rax, %rcx
imulq %rsi, %rdx
addq  %rdx, %rcx
mulq  %rsi
addq  %rcx, %rdx
movq  %rax, (%rdi)
movq  %rdx, 8(%rdi)
ret

I don't know why it performs: xh * yl + yh * xl = value which we add after unsigned multiplication

like image 329
denis631 Avatar asked Nov 18 '15 19:11

denis631


2 Answers

As always, compiler options matter. That source code with gcc -Og (optimize for debugging) produces very similar asm to your listing (the cast sign-extends both operands to 128-bit before doing a full 128x128 => 128-bit multiply). This is a naive implementation of exactly what the C standard says should happen (integer precedence rules for converting both operands to the same type).

If you're going to talk about compiler output, you should always say which version of which compiler with what options. Or just post a link to it on godbolt, like the one above. (Edit: oops, source and asm were from a book that didn't give that info. And if that's the global edition of CS:APP 3e, beware that the practice problems are filled with errors in the global edition.)

With gcc -O3 or -O2, GCC takes advantage of the fact that both operands are still really only 64bit, so a single imul is enough. (This still produces the same result for every input, and thus still implements the C logic, per the as-if rule. C doesn't have widening operations so you're forced to write the source in an "inefficient" way that depends on the compiler to transform it into efficient asm.)


The sar $63, %rcx is part of sign-extending rsi into rcx:rsi, just like cqto sign-extends rax into rdx:rax. It replaces every bit of RCX with a copy of the original sign bit.


Most of this answer was already given by other people in comments, but I don't think anyone else noticed that gcc -Og / -O1 gives almost exactly that asm output.

like image 72
Peter Cordes Avatar answered Oct 01 '22 11:10

Peter Cordes


What GCC is doing is using the property that signed multiplication can be done using the following formula.

(hi,lo) = unsigned(x*y)
hi -= ((x<0) ? y : 0)  + ((y<0) ? x : 0)

Despite the fact that there is no need to do this since in this case the x86-64 instruction set has a signed 64-bit*64-bit to 128-bit instruction (imul with one operand) this formula is useful in other cases. For example for implementing signed 128-bit multiplication with SSE2/AVX2/AVX512 or for implementing 256-bit multiplication when the instruction set only does 128-bit multiplication (such as with x86-64).

GCC implemented this formula a bit differently though. If we take the sign bit and extend it to the whole word, call this function sign_ext, then the function returns -1 or 0. Then what GCC did is:

hi += sign_ext(x)*y + sign_ext(y)*x

for example sign_ext(x)*y in pseudo-instructions for 64-bit words is

sarq  $63, x    ; sign_ext(x)
imulq   y, x    ; sign_ext(x)*y

So now you ask (or meant to ask):

Why is this formula true?

That's a good qeustion. I asked this same question as well and njuffa wrote

@Zboson: It follows directly from two's complement complement representation. E.g. 32-bit integers -n and -m are represented as unsigned numbers x=2**32-n, y=2**32-m. If you multiply those you have x*y = 2**64 - 2**32*n - 2**32*m + n*m. The middle terms indicate the necessary corrections to the upper half of the product. Working through a simple example using -1*-1 should prove very instructive.

like image 41
Z boson Avatar answered Oct 01 '22 10:10

Z boson