Consider this function: <pre class="prettyprint"><code>unsigned long f(unsigned long x) { return x / 7; } </code></pre> With <code>-O3</code>, Clang turns the division into a multiplication, as expected: <pre class="prettyprint"><code>f: # @f movabs rcx, 2635249153387078803 mov rax, rdi mul rcx sub rdi, rdx shr rdi lea rax, [rdi + rdx] shr rax, 2 ret </code></pre> GCC does basically the same thing, except for using <code>rdx</code> where Clang uses <code>rcx</code>. But they both appear to be doing an extra move. Why not this instead? <pre class="prettyprint"><code>f: movabs rax, 2635249153387078803 mul rdi sub rdi, rdx shr rdi lea rax, [rdi + rdx] shr rax, 2 ret </code></pre> In particular, they both put the numerator in <code>rax</code>, but by putting the magic number there instead, you avoid having to move the numerator at all. If this is actually better, I'm surprised that neither GCC nor Clang do it this way, since it feels so obvious. Is there some microarchitectural reason that their way is actually faster than my way? Godbolt link.

Visual Studio 2015 generates the code you expected, rcx = input dividend: <pre class="prettyprint"><code> mov rax, 2635249153387078803 mul rcx sub rcx, rdx shr rcx, 1 lea rax, QWORD PTR [rdx+rcx] shr rax, 2 </code></pre> <hr> A divisor of 7 needs a 65 bit multiplier to get the proper accuracy. <pre class="prettyprint"><code>floor((2^(64+ceil(log2(7))))/7)+1 = floor((2^67)/7)+1 = 21081993227096630419 </code></pre> Removing the most significant bit, 2^64, results in 21081993227096630419 - 2^64 = 2635249153387078803, which is the multiplier actually used in the code. The generated code compensates for the missing 2^64 bit, which is explained in figure 4.1 and equation 4.5 in this pdf file: https://gmplib.org/~tege/divcnst-pldi94.pdf Further explanation can be seen in this prior answer: Why does GCC use multiplication by a strange number in implementing integer division? If the 65 bit multiplier has a trailing 0 bit, then it can be shifted right 1 bit to result in a 64 bit multiplier, reducing the number of instructions. For example if dividing by 5: <pre class="prettyprint"><code>floor((2^(64+ceil(log2(5))))/5)+1 = floor((2^67)/5)+1 = 29514790517935282586 29514790517935282586 >> 1 = 14757395258967641293 mov rax, -3689348814741910323 ; == 14757395258967641293 == 0cccccccccccccccdH mul rcx shr rdx, 2 mov rax, rdx </code></pre>

Is an extra move somehow faster when doing division-by-multiplication?

Tags:

c

gcc

assembly

x86-64

clang

Consider this function:

unsigned long f(unsigned long x) {
    return x / 7;
}

With -O3, Clang turns the division into a multiplication, as expected:

f:                                      # @f
        movabs  rcx, 2635249153387078803
        mov     rax, rdi
        mul     rcx
        sub     rdi, rdx
        shr     rdi
        lea     rax, [rdi + rdx]
        shr     rax, 2
        ret

GCC does basically the same thing, except for using rdx where Clang uses rcx. But they both appear to be doing an extra move. Why not this instead?

f:
        movabs  rax, 2635249153387078803
        mul     rdi
        sub     rdi, rdx
        shr     rdi
        lea     rax, [rdi + rdx]
        shr     rax, 2
        ret

In particular, they both put the numerator in rax, but by putting the magic number there instead, you avoid having to move the numerator at all. If this is actually better, I'm surprised that neither GCC nor Clang do it this way, since it feels so obvious. Is there some microarchitectural reason that their way is actually faster than my way?

Godbolt link.

646

asked Apr 27 '20 23:04

Joseph Sible-Reinstate Monica

2 Answers

This very much looks like a missed optimization by both gcc and clang; no benefit to that extra mov.

If it's not already reported, GCC and LLVM both accept missed-optimization bug reports: https://bugs.llvm.org/ and https://gcc.gnu.org/bugzilla/. For GCC there's even a bug tag "missed-optimization".

Wasted mov instructions are unfortunately not rare, especially when looking at tiny functions where the input / output regs are nailed down the calling convention, not up to the register allocator. The do still happen in loops sometimes, like doing a bunch of extra work each iteration so everything is in the right places for the code that runs once after a loop. /facepalm.

Zero-latency mov (mov-elimination) helps reduce the cost of such missed optimizations (and cases where mov isn't avoidable), but it still takes a front-end uop so it's pretty much strictly worse. (Except by chance where it helps alignment of something later, but if that's the reason then a nop would have been as good).

And it takes up space in the ROB, reducing how far ahead out-of-order exec can see past a cache miss or other stall. mov is never truly free, only the execution-unit and latency part is eliminated - Can x86's MOV really be "free"? Why can't I reproduce this at all?

My total guess about compiler internals:

Probably gcc/clang's internal machinery need to learn that this division pattern is commutative and can take the input value in some other register and put the constant in RAX.

In a loop they'd want the constant in some other register so they could reuse it, but hopefully the compiler could still figure that out for cases where it's useful.

answered Oct 01 '22 09:10

Peter Cordes

Visual Studio 2015 generates the code you expected, rcx = input dividend:

        mov     rax, 2635249153387078803
        mul     rcx
        sub     rcx, rdx
        shr     rcx, 1
        lea     rax, QWORD PTR [rdx+rcx]
        shr     rax, 2

A divisor of 7 needs a 65 bit multiplier to get the proper accuracy.

floor((2^(64+ceil(log2(7))))/7)+1 = floor((2^67)/7)+1 = 21081993227096630419

Removing the most significant bit, 2^64, results in 21081993227096630419 - 2^64 = 2635249153387078803, which is the multiplier actually used in the code.

The generated code compensates for the missing 2^64 bit, which is explained in figure 4.1 and equation 4.5 in this pdf file:

https://gmplib.org/~tege/divcnst-pldi94.pdf

Further explanation can be seen in this prior answer:

Why does GCC use multiplication by a strange number in implementing integer division?

If the 65 bit multiplier has a trailing 0 bit, then it can be shifted right 1 bit to result in a 64 bit multiplier, reducing the number of instructions. For example if dividing by 5:

floor((2^(64+ceil(log2(5))))/5)+1 = floor((2^67)/5)+1 = 29514790517935282586
29514790517935282586 >> 1 = 14757395258967641293

        mov     rax, -3689348814741910323 ; == 14757395258967641293 ==  0cccccccccccccccdH
        mul     rcx
        shr     rdx, 2
        mov     rax, rdx

answered Oct 01 '22 09:10

rcgldr

Related questions
                            
                                Why does the linker generate seemingly useless relocations in .rela.plt?
                            
                                Can I read an atomic variable without atomic_load?
                            
                                How to read Import Address Table, in a driver, from a PEPROCESS?
                            
                                Unclear use of (void) with register value
                            
                                Is virtual memory used when using Port-mapped I/O?
                            
                                Why does GCC's -Wconversion behave differently for char vs. unsigned char?
                            
                                Passing list of numpy arrays to C using cython
                            
                                How to detect if memory is from the stack? (not heap or a static variable)
                            
                                extern "C" for member static callback function
                            
                                mmap Mac: Segmentation fault
                            
                                How and when should I use _POSIX_C_SOURCE in C programs?
                            
                                C file pointer changing after fork and (failed) exec
                            
                                LLVM Pass to insert an external function call to LLVM bitcode
                            
                                Strict aliasing rule in C
                            
                                How does the kernel use task_struct?
                            
                                Undefined behavior inside void expressions
                            
                                How to get only function blocks using sly
                            
                                Difference between __atomic_load_n and __atomic_load
                            
                                VSCode import "C" fails on modules
                            
                                Why does clang-tidy say vsnprintf has an uninitialized va_list argument?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With