I have the following C/C++ function:
unsigned div3(unsigned x) { return x / 3; }
When compiled using clang 10 at -O3
, this results in:
div3(unsigned int): mov ecx, edi # tmp = x mov eax, 2863311531 # result = 3^-1 imul rax, rcx # result *= tmp shr rax, 33 # result >>= 33 ret
What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.
There are some things that I don't understand though:
ecx
/rcx
at all? Can't we multiply rax
with edi
directly?eax
and ecx
?imul
instead of mul
? I thought modular arithmetic would be all unsigned.For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here. For example:
// multiplying with inverse of 3: 15 * 2863311531 = 42949672965 42949672965 mod 2^32 = 5 // using fixed-point multiplication 15 * 2863311531 = 42949672965 42949672965 >> 33 = 5 // simply dividing by 3 15 / 3 = 5
So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.
I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.
Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul
instruction in this case, even when it could. The following function has the same compile output as above.
unsigned div3(unsigned x) { __builtin_assume(x % 3 == 0); return x / 3; }
(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)
Division. Division is between two operands of arithmetic type. Overflow can occur during two's complement signed integer division when the dividend is equal to the minimum (negative) value for the signed integer type and the divisor is equal to −1 . Division operations are also susceptible to divide-by-zero errors.
The "Divide-by-zero" exception is for dividing by zero with the div instruction.
- Can't we multiply rax with edi directly?
We can't imul rax, rdi
because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.
(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).
Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.
- Why do we multiply in 64-bit mode?
(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)
You could mul edi
to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi
is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul
. (Although imul r64, r64
is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF) (Fun fact: mul rdi
is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi
would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)
Also the part you want is in EDX so you'd need another mov eax, edx
to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)
GCC 8.3 and earlier did use 32-bit mul
instead of 64-bit imul
(https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic
when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov
instruction copying EDI to EAX, making this way look even worse :/
# gcc8.3 -O3 (default -mtune=generic) div3(unsigned int): mov eax, edi # 1 uop, stupid wasted instruction mov edx, -1431655765 # 1 uop (same 32-bit constant, just printed differently) mul edx # 3 uops on Sandybridge-family mov eax, edx # 1 uop shr eax # 1 uop ret # total of 7 uops on SnB-family
Would only be 6 uops with mov eax, 0xAAAAAAAB
/ mul edi
, but still worse than:
# gcc9.3 -O3 (default -mtune=generic) div3(unsigned int): mov eax, edi # 1 uop mov edi, 2863311531 # 1 uop imul rax, rdi # 1 uop shr rax, 33 # 1 uop ret # total 4 uops, not counting ret
Unfortunately, 64-bit 0x00000000AAAAAAAB
can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB
isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB
.
- Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg
doesn't produce the high half. Only the one-operand forms of mul
and imul
are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.
Only imul
has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg
is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r
whose only difference from imul r,r
is the FLAGS output.
Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.
- What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
No, there's no multiplier constant that would give the exact right answer for every possible input x
if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x
other than full range of unsigned
, compilers don't have that option. (-ffast-math
only applies to floating point; if you want faster approximations for integer math, code them manually like below):
See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.
For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed
// Warning: INEXACT FOR LARGE INPUTS // this fast approximation can just use the high half, // so on 32-bit machines it avoids one shift instruction vs. exact division int32_t div10(int32_t dividend) { int64_t invDivisor = 0x1999999A; return (int32_t) ((invDivisor * dividend) >> 32); }
Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183
when 1073741829/10
is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)
From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.
As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab
= 0xaaaaaab0
when truncated to 32-bit, not 5
.
unsigned div3_exact_only(unsigned x) { __builtin_assume(x % 3 == 0); // or an equivalent with if() __builtin_unreachable() return x / 3; }
Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.
div3_exact_only: imul eax, edi, 0xAAAAAAAB # 1 uop, 3c latency ret
However, it is something you can do yourself in source code, at least for known type widths like uint32_t
:
uint32_t div3_exact_only(uint32_t x) { return x * 0xaaaaaaabU; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With