Why does GCC use multiplication by a strange number in implementing integer division?

File division.c

#include <stdlib.h>
#include <stdio.h>

int main()
{
    size_t i = 9;
    size_t j = i / 5;
    printf("%zu\n",j);
    return 0;
}

And then generating assembly language code with:

gcc -S division.c -O0 -masm=intel

But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:

mov     rax, QWORD PTR [rbp-16]   ; Move i (=9) to RAX
movabs  rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul     rdx                       ; Multiply 9 by magic number
mov     rax, rdx                  ; Take only the upper 64 bits of the result
shr     rax, 2                    ; Shift these bits 2 places to the right (?)
mov     QWORD PTR [rbp-8], rax    ; Magically, RAX contains 9/5=1 now, 
                                  ; so we can assign it to j

What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?

293

asked Dec 16 '16 11:12

qiubit

4 Answers

Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).

If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.

Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add

As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.

For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.

100

answered Oct 06 '22 06:10

Sneftel

Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).

As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.

See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.

Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).

answered Oct 06 '22 06:10

abligh

In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant

A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.

-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.

When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.

This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?

To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.

The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.

The error in our constant is (1/5) * 2^-64. The value of i is less than 2⁶⁴ so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2⁻².

(1/5) * 2⁻² < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.

Unfortunately this doesn't work for all divisors.

If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2^-64. After multiplying by an i value of just under 2⁶⁴ we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.

So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.

answered Oct 06 '22 07:10

plugwash

Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.

http://gmplib.org/~tege/divcnst-pldi94.pdf

In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.

Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.

Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.

In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.

Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:

; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow  = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend  dq    2 dup(?)        ;16 byte dividend
divisor   dq    1 dup(?)        ; 8 byte divisor

; ...
        mov     rcx,divisor
        mov     rdx,0
        mov     rax,dividend+8     ;upper 8 bytes of dividend
        div     rcx                ;after div, rax == 1
        mov     rax,dividend       ;lower 8 bytes of dividend
        div     rcx
        mov     rdx,1              ;rdx:rax = N+1 bit value = 65 bit value

You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).

On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:

;       rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
;       two instruction sequence for most divisors:

        mul     rbx                     ;rdx = upper 64 bits of product
        shr     rdx,cl                  ;rdx = quotient
;
;       five instruction sequence for divisors like 7
;       to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)

        mul     rbx                     ;rdx = upper 64 bits of product
        sub     rbx,rdx                 ;rbx -= rdx
        shr     rbx,1                   ;rbx >>= 1
        add     rdx,rbx                 ;rdx = upper 64 bits of corrected product
        shr     rdx,cl                  ;rdx = quotient
;       ...

To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)

        mul     rbx                     ;rdx = u64(dvnd*mplr)
        add     rdx,rbx                 ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
        shr     rdx,cl

To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:

        u64( (  rbx * (2^64 + rax) )>>(cl+1) )
        u64( (  rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
        u64( (  (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
        u64( (  (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
        u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl  ) )

        mul     rbx                     ;   (rbx*rax)
        sub     rbx,rdx                 ;   (rbx*2^64)-(rbx*rax)
        shr     rbx,1                   ;(  (rbx*2^64)-(rbx*rax))>>1
        add     rdx,rbx                 ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
        shr     rdx,cl                  ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl

answered Oct 06 '22 06:10

rcgldr

Related questions
                            
                                Is an array name a pointer?
                            
                                C read file line by line
                            
                                What is the Difference Between read() and recv() , and Between send() and write()?
                            
                                printf() formatting for hexadecimal
                            
                                Interview question: Check if one string is a rotation of other string [closed]
                            
                                Why do function pointer definitions work with any number of ampersands '&' or asterisks '*'?
                            
                                How do you compare structs for equality in C?
                            
                                typedef fixed length array
                            
                                How do I restrict a float value to only two places after the decimal point in C?
                            
                                Why is unsigned integer overflow defined behavior but signed integer overflow isn't?
                            
                                Are memory leaks ever ok? [closed]
                            
                                static const vs #define
                            
                                Is the sizeof(some pointer) always equal to four?
                            
                                Is it better to use C void arguments "void foo(void)" or not "void foo()"? [duplicate]
                            
                                Single quotes vs. double quotes in C or C++
                            
                                #define macro for debug printing in C?
                            
                                Why does this for loop exit on some platforms and not on others?
                            
                                How can I suppress "unused parameter" warnings in C?
                            
                                Execution time of C program
                            
                                What is Bit Masking?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does GCC use multiplication by a strange number in implementing integer division?

Tags:

c

gcc

assembly

integer-division

x86-64

intel

File division.c

qiubit

People also ask

4 Answers

Sneftel

abligh

plugwash

rcgldr

Recent Activity

Donate For Us