I have a 128-bit unsigned integer A and a 64-bit unsigned integer B. What's the fastest way to calculate A % B
- that is the (64-bit) remainder from dividing A by B?
I'm looking to do this in either C or assembly language, but I need to target the 32-bit x86 platform. This unfortunately means that I cannot take advantage of compiler support for 128-bit integers, nor of the x64 architecture's ability to perform the required operation in a single instruction.
Edit:
Thank you for the answers so far. However, it appears to me that the suggested algorithms would be quite slow - wouldn't the fastest way to perform a 128-bit by 64-bit division be to leverage the processor's native support for 64-bit by 32-bit division? Does anyone know if there is a way to perform the larger division in terms of a few smaller divisions?
Re: How often does B change?
Primarily I'm interested in a general solution - what calculation would you perform if A and B are likely to be different every time?
However, a second possible situation is that B does not vary as often as A - there may be as many as 200 As to divide by each B. How would your answer differ in this case?
Unlike the decimal number system where we use the digits 0 to 9 to represent a number, in a binary system, we use only 2 digits that are 0 and 1 (bits). We have used 8 bits to represent 128 in binary.
Software. In the same way that compilers emulate e.g. 64-bit integer arithmetic on architectures with register sizes less than 64 bits, some compilers also support 128-bit integer arithmetic. For example, the GCC C compiler 4.6 and later has a 128-bit integer type __int128 for some architectures.
The 128-bit data type can handle up to 31 significant digits (compared to 17 handled by the 64-bit long double). However, while this data type can store numbers with more precision than the 64-bit data type, it does not store numbers of greater magnitude.
DrinkMoreBoilingWater's blog. As an extension the integer scalar type __int128 is supported for targets which have an integer mode wide enough to hold 128 bits. Simply write __int128 for a signed 128-bit integer, or unsigned __int128 for an unsigned 128-bit integer.
You can use the division version of Russian Peasant Multiplication.
To find the remainder, execute (in pseudo-code):
X = B;
while (X <= A/2)
{
X <<= 1;
}
while (A >= B)
{
if (A >= X)
A -= X;
X >>= 1;
}
The modulus is left in A.
You'll need to implement the shifts, comparisons and subtractions to operate on values made up of a pair of 64 bit numbers, but that's fairly trivial (likely you should implement the left-shift-by-1 as X + X
).
This will loop at most 255 times (with a 128 bit A). Of course you need to do a pre-check for a zero divisor.
Perhaps you're looking for a finished program, but the basic algorithms for multi-precision arithmetic can be found in Knuth's Art of Computer Programming, Volume 2. You can find the division algorithm described online here. The algorithms deal with arbitrary multi-precision arithmetic, and so are more general than you need, but you should be able to simplify them for 128 bit arithmetic done on 64- or 32-bit digits. Be prepared for a reasonable amount of work (a) understanding the algorithm, and (b) converting it to C or assembler.
You might also want to check out Hacker's Delight, which is full of very clever assembler and other low-level hackery, including some multi-precision arithmetic.
If your B is small enough for the uint64_t
+
operation to not wrap:
Given A = AH*2^64 + AL
:
A % B == (((AH % B) * (2^64 % B)) + (AL % B)) % B
== (((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B
If your compiler supports 64-bit integers, then this is probably the easiest way to go.
MSVC's implementation of a 64-bit modulo on 32-bit x86 is some hairy loop filled assembly (VC\crt\src\intel\llrem.asm
for the brave), so I'd personally go with that.
This is almost untested partly speed modificated Mod128by64 'Russian peasant' algorithm function. Unfortunately I'm a Delphi user so this function works under Delphi. :) But the assembler is almost the same so...
function Mod128by64(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = @Dividend
// : edx = @Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//Divisor = edx:ebp
//Dividend = bh:ebx:edx //We need 64 bits + 1 bit in bh
//Result = esi:edi
//ecx = Loop counter and Dividend index
push ebx //Store registers to stack
push esi
push edi
push ebp
mov ebp, [edx] //Divisor = edx:ebp
mov edx, [edx + 4]
mov ecx, ebp //Div by 0 test
or ecx, edx
jz @DivByZero
xor edi, edi //Clear result
xor esi, esi
//Start of 64 bit division Loop
mov ecx, 15 //Load byte loop shift counter and Dividend index
@SkipShift8Bits: //Small Dividend numbers shift optimisation
cmp [eax + ecx], ch //Zero test
jnz @EndSkipShiftDividend
loop @SkipShift8Bits //Skip 8 bit loop
@EndSkipShiftDividend:
test edx, $FF000000 //Huge Divisor Numbers Shift Optimisation
jz @Shift8Bits //This Divisor is > $00FFFFFF:FFFFFFFF
mov ecx, 8 //Load byte shift counter
mov esi, [eax + 12] //Do fast 56 bit (7 bytes) shift...
shr esi, cl //esi = $00XXXXXX
mov edi, [eax + 9] //Load for one byte right shifted 32 bit value
@Shift8Bits:
mov bl, [eax + ecx] //Load 8 bits of Dividend
//Here we can unrole partial loop 8 bit division to increase execution speed...
mov ch, 8 //Set partial byte counter value
@Do65BitsShift:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
setc bh //Save 65th bit
sub edi, ebp //Compare dividend and divisor
sbb esi, edx //Subtract the divisor
sbb bh, 0 //Use 65th bit in bh
jnc @NoCarryAtCmp //Test...
add edi, ebp //Return privius dividend state
adc esi, edx
@NoCarryAtCmp:
dec ch //Decrement counter
jnz @Do65BitsShift
//End of 8 bit (byte) partial division loop
dec cl //Decrement byte loop shift counter
jns @Shift8Bits //Last jump at cl = 0!!!
//End of 64 bit division loop
mov eax, edi //Load result to eax:edx
mov edx, esi
@RestoreRegisters:
pop ebp //Restore Registers
pop edi
pop esi
pop ebx
ret
@DivByZero:
xor eax, eax //Here you can raise Div by 0 exception, now function only return 0.
xor edx, edx
jmp @RestoreRegisters
end;
At least one more speed optimisation is possible! After 'Huge Divisor Numbers Shift Optimisation' we can test divisors high bit, if it is 0 we do not need to use extra bh register as 65th bit to store in it. So unrolled part of loop can look like:
shl bl,1 //Shift dividend left for one bit
rcl edi,1
rcl esi,1
sub edi, ebp //Compare dividend and divisor
sbb esi, edx //Subtract the divisor
jnc @NoCarryAtCmpX
add edi, ebp //Return privius dividend state
adc esi, edx
@NoCarryAtCmpX:
I know the question specified 32-bit code, but the answer for 64-bit may be useful or interesting to others.
And yes, 64b/32b => 32b division does make a useful building-block for 128b % 64b => 64b. libgcc's __umoddi3
(source linked below) gives an idea of how to do that sort of thing, but it only implements 2N % 2N => 2N on top of a 2N / N => N division, not 4N % 2N => 2N.
Wider multi-precision libraries are available, e.g. https://gmplib.org/manual/Integer-Division.html#Integer-Division.
GNU C on 64-bit machines does provide an __int128
type, and libgcc functions to multiply and divide as efficiently as possible on the target architecture.
x86-64's div r/m64
instruction does 128b/64b => 64b division (also producing remainder as a second output), but it faults if the quotient overflows. So you can't directly use it if A/B > 2^64-1
, but you can get gcc to use it for you (or even inline the same code that libgcc uses).
This compiles (Godbolt compiler explorer) to one or two div
instructions (which happen inside a libgcc function call). If there was a faster way, libgcc would probably use that instead.
#include <stdint.h>
uint64_t AmodB(unsigned __int128 A, uint64_t B) {
return A % B;
}
The __umodti3
function it calls calculates a full 128b/128b modulo, but the implementation of that function does check for the special case where the divisor's high half is 0, as you can see in the libgcc source. (libgcc builds the si/di/ti version of the function from that code, as appropriate for the target architecture. udiv_qrnnd
is an inline asm macro that does unsigned 2N/N => N division for the target architecture.
For x86-64 (and other architectures with a hardware divide instruction), the fast-path (when high_half(A) < B
; guaranteeing div
won't fault) is just two not-taken branches, some fluff for out-of-order CPUs to chew through, and a single div r64
instruction, which takes about 50-100 cycles1 on modern x86 CPUs, according to Agner Fog's insn tables. Some other work can be happening in parallel with div
, but the integer divide unit is not very pipelined and div
decodes to a lot of uops (unlike FP division).
The fallback path still only uses two 64-bit div
instructions for the case where B
is only 64-bit, but A/B
doesn't fit in 64 bits so A/B
directly would fault.
Note that libgcc's __umodti3
just inlines __udivmoddi4
into a wrapper that only returns the remainder.
Footnote 1: 32-bit div
is over 2x faster on Intel CPUs. On AMD CPUs, performance only depends on the size of the actual input values, even if they're small values in a 64-bit register. If small values are common, it might be worth benchmarking a branch to a simple 32-bit division version before doing 64-bit or 128-bit division.
B
It might be worth considering calculating a fixed-point multiplicative inverse for B
, if one exists. For example, with compile-time constants, gcc does the optimization for types narrower than 128b.
uint64_t modulo_by_constant64(uint64_t A) { return A % 0x12345678ABULL; }
movabs rdx, -2233785418547900415
mov rax, rdi
mul rdx
mov rax, rdx # wasted instruction, could have kept using RDX.
movabs rdx, 78187493547
shr rax, 36 # division result
imul rax, rdx # multiply and subtract to get the modulo
sub rdi, rax
mov rax, rdi
ret
x86's mul r64
instruction does 64b*64b => 128b (rdx:rax) multiplication, and can be used as a building block to construct a 128b * 128b => 256b multiply to implement the same algorithm. Since we only need the high half of the full 256b result, that saves a few multiplies.
Modern Intel CPUs have very high performance mul
: 3c latency, one per clock throughput. However, the exact combination of shifts and adds required varies with the constant, so the general case of calculating a multiplicative inverse at run-time isn't quite as efficient each time its used as a JIT-compiled or statically-compiled version (even on top of the pre-computation overhead).
IDK where the break-even point would be. For JIT-compiling, it will be higher than ~200 reuses, unless you cache generated code for commonly-used B
values. For the "normal" way, it might possibly be in the range of 200 reuses, but IDK how expensive it would be to find a modular multiplicative inverse for 128-bit / 64-bit division.
libdivide can do this for you, but only for 32 and 64-bit types. Still, it's probably a good starting point.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With