This is my test code:
#include <chrono> #include <iostream> #include <cstdlib> using namespace std; using ll = long long; int main() { __int128_t a, b; ll x, y; a = rand() + 10000000; b = rand() % 50000; auto t0 = chrono::steady_clock::now(); for (int i = 0; i < 100000000; i++) { a += b; a /= b; b *= a; b -= a; a %= b; } cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' ' << (ll)a % 100000 << '\n'; x = rand() + 10000000; y = rand() % 50000; t0 = chrono::steady_clock::now(); for (int i = 0; i < 100000000; i++) { x += y; x /= y; y *= x; y -= x; x %= y; } cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' ' << (ll)x % 100000 << '\n'; return 0; }
This is the test result:
$ g++ main.cpp -o main -O2 $ ./main 2432 1 2627 1
Using GCC 10.1.0 on x64 GNU/Linux, no matter if it's using optimization of -O2 or un-optimized, __int128_t
is always a little faster than long long
.
int
and double
are both significantly faster than long long
; long long
has become the slowest type.
How does this happen?
The performance difference comes from the efficiency of 128-bit divisions/modulus with GCC/Clang in this specific case.
Indeed, on my system as well as on godbolt, sizeof(long long) = 8
and sizeof(__int128_t) = 16
. Thus operation on the former are performed by native instruction while not the latter (since we focus on 64 bit platforms). Additions, multiplications and subtractions are slower with __int128_t
. But, built-in functions for divisions/modulus on 16-byte types (__divti3
and __modti3
on x86 GCC/Clang) are surprisingly faster than the native idiv
instruction (which is pretty slow, at least on Intel processors).
If we look deeper in the implementation of GCC/Clang built-in functions (used only for __int128_t
here), we can see that __modti3
uses conditionals (when calling __udivmodti4
). Intel processors can execute the code faster because:
div
instruction is still used in most possible paths (especially in this case);div
/idiv
instructions covers most of the overall execution time because of their very high latencies. The div
/idiv
instructions cannot be executed in parallel because of the loop dependencies. However, the latency of a div
lower than a idiv
making the former faster.Please note that the performance of the two implementations can highly differ from one architecture to another (because of the number of CPU ports, the branch prediction capability and the latency/througput of the idiv
instruction). Indeed, the latency of a 64-bit idiv
instruction takes 41-95 cycles on Skylake while it takes 8-41 cycles on AMD Ryzen processors for example. Respectively the latency of a div
is about 6-89 cycles on Skylake and still the same on Ryzen. This means that the benchmark performance results should be significantly different on Ryzen processors (the opposite effect may be seen due to the additional instructions/branch costs in the 128-bits case).
TL:DR: __int128
division helper functions internally end up doing an unsigned div reg64
(after some branching on the values being positive and the upper halves being 0
). 64-bit div
is faster on Intel CPUs than the signed idiv reg64
that GCC inlines for signed long long
. Faster by enough to make up for all the extra overhead of the helper function, and extended-precision for the other operations.
You probably wouldn't see this effect on AMD CPUs: long long
would be faster as expected because idiv r64
is similar enough in perf to div r64
there.
And unsigned long long
is faster than unsigned __int128
even on Intel CPUs, for example on my i7-6700k (Skylake) at 3.9GHz (run under perf stat
to be sure of CPU frequency during the test):
div
vs. idiv
.Also, drawing any general conclusions from a very specific micro-benchmark like this would be a bad idea. It is interesting to dig into why exactly the extended-precision __int128
type manages to be faster in this division benchmark with positive numbers small enough to fit in a 32-bit integer, though.
Your benchmark is heavily weighted towards division, which you do twice per iteration (/
and %
), even though it's much more expensive than other operations and in most code used much less often. (e.g. sum a whole array then divide once to get the average.)
Your benchmark is also has no instruction-level parallelism: each step has a data dependency on the previous step. This prevents auto-vectorization or anything that would show some of the advantages of narrower types.
(It's also not careful to avoid warm-up effects like the first timed region being slow until the CPU gets up to max turbo. Idiomatic way of performance evaluation?. But that happens much faster than the couple seconds of your timed regions, so that's not a problem here.)
128-bit integer division (especially signed) is too complicated for GCC to want to inline, so gcc emits a call to a helper function, __divti3
or __modti3
. (TI = tetra-integer, GCC's internal name for an integer that's 4x the size of int
.) These functions are documented in the GCC-internals manual.
You can see the compiler-generated asm on the Godbolt compiler-explorer. i.e. 128-bit addition with add/adc, multiplication with one mul
full-multiply of the low halves, and 2x non-widening imul
of the cross products. Yes these are slower than the single-instruction equivalents for int64_t
.
But Godbolt doesn't show you the asm for libgcc helper functions. It doesn't disassemble them even in "compile-to-binary" and disassemble mode (instead of the usual compiler asm text output) because it dynamically links libgcc_s instead of libgcc.a
.
Extended-precision signed division is done by negating if necessary and doing unsigned division of 64-bit chunks, then fixing up the sign of the result if necessary.
With both inputs small and positive, no actual negation is needed (just testing and branching). There are also fast-paths for small numbers (high-half divisor = 0, and quotient will fit in 64 bits), which is the case here. The end result is that the path of execution through __divti3
looks like this:
This is from manually single-stepping into the call to __divti3
with gdb, after compiling with g++ -g -O3 int128-bench.cpp -o int128-bench.O3
on my Arch GNU/Linux system, with gcc-libs 10.1.0-2.
# Inputs: dividend = RSI:RDI, divisor = RCX:RDX # returns signed quotient RDX:RAX | >0x7ffff7c4fd40 <__divti3> endbr64 # in case caller was using CFE (control-flow enforcement), apparently this instruction has to pollute all library functions now. I assume it's cheap at least in the no-CFE case. │ 0x7ffff7c4fd44 <__divti3+4> push r12 │ 0x7ffff7c4fd46 <__divti3+6> mov r11,rdi │ 0x7ffff7c4fd49 <__divti3+9> mov rax,rdx │ 0x7ffff7c4fd4c <__divti3+12> xor edi,edi │ 0x7ffff7c4fd4e <__divti3+14> push rbx │ 0x7ffff7c4fd4f <__divti3+15> mov rdx,rcx │ 0x7ffff7c4fd52 <__divti3+18> test rsi,rsi # check sign bit of dividend (and jump over a negation) │ 0x7ffff7c4fd55 <__divti3+21> jns 0x7ffff7c4fd6e <__divti3+46> ... taken branch to | >0x7ffff7c4fd6e <__divti3+46> mov r10,rdx │ 0x7ffff7c4fd71 <__divti3+49> test rdx,rdx # check sign bit of divisor (and jump over a negation), note there was a mov rdx,rcx earlier │ 0x7ffff7c4fd74 <__divti3+52> jns 0x7ffff7c4fd86 <__divti3+70> ... taken branch to │ >0x7ffff7c4fd86 <__divti3+70> mov r9,rax │ 0x7ffff7c4fd89 <__divti3+73> mov r8,r11 │ 0x7ffff7c4fd8c <__divti3+76> test r10,r10 # check high half of abs(divisor) for being non-zero │ 0x7ffff7c4fd8f <__divti3+79> jne 0x7ffff7c4fdb0 <__divti3+112> # falls through: small-number fast path │ 0x7ffff7c4fd91 <__divti3+81> cmp rax,rsi # check that quotient will fit in 64 bits so 128b/64b single div won't fault: jump if (divisor <= high half of dividend) │ 0x7ffff7c4fd94 <__divti3+84> jbe 0x7ffff7c4fe00 <__divti3+192> # falls through: small-number fast path │ 0x7ffff7c4fd96 <__divti3+86> mov rdx,rsi │ 0x7ffff7c4fd99 <__divti3+89> mov rax,r11 │ 0x7ffff7c4fd9c <__divti3+92> xor esi,esi │ >0x7ffff7c4fd9e <__divti3+94> div r9 #### Do the actual division ### │ 0x7ffff7c4fda1 <__divti3+97> mov rcx,rax │ 0x7ffff7c4fda4 <__divti3+100> jmp 0x7ffff7c4fdb9 <__divti3+121> ...taken branch to │ >0x7ffff7c4fdb9 <__divti3+121> mov rax,rcx │ 0x7ffff7c4fdbc <__divti3+124> mov rdx,rsi │ 0x7ffff7c4fdbf <__divti3+127> test rdi,rdi # check if the result should be negative │ 0x7ffff7c4fdc2 <__divti3+130> je 0x7ffff7c4fdce <__divti3+142> ... taken branch over a neg rax / adc rax,0 / neg rdx │ >0x7ffff7c4fdce <__divti3+142> pop rbx │ 0x7ffff7c4fdcf <__divti3+143> pop r12 │ 0x7ffff7c4fdd1 <__divti3+145> ret ... return back to the loop body that called it
Intel CPUs (since IvyBridge) have zero-latency mov
, so all that overhead doesn't worsen the critical path latency (which is your bottleneck) significantly. Or at least not enough to make up the difference between idiv
and div
.
The branching is handled by branch prediction and speculative execution, only checking the predictions after the fact when the actual input register values are the same. The branching goes the same way every time so it's trivial for branch prediction to learn. Since division is so slow, there's plenty of time for out-of-order exec to catch up.
64-bit operand-size integer division is very slow on Intel CPUs, even when the numbers are actually small and would fit in a 32-bit integer, and the extra microcode for signed integer division is even more expensive.
e.g. on my Skylake (i7-6700k), https://uops.info/ shows that (table search result )
idiv r64
is 56 uops for the front-end, with latency from 41 to 95 cycles (from divisor to quotient, which is the relevant case here I think).div r64
is 33 uops for the front-end, with latency from 35 to 87 cycles. (for that same latency path).The latency best-case happens for small quotients or small dividends or something, I can never remember which.
Similar to the branching that GCC does in software for 128-bit division in terms of 64-bit, I think the CPU microcode is internally doing 64-bit division in terms of narrower operations, probably the 32-bit that's only 10 uops for signed or unsigned, with much lower latency. (Ice Lake improves the divider so 64-bit division isn't much slower than 32-bit.)
This is why you found long long
so much slower than int
for this benchmark. In a lot of cases it's about the same, or half speed if memory bandwidth or SIMD are involved. (Only 2 elements per 128-bit of vector width, not 4).
AMD CPUs handle 64-bit operand size more efficiently, with performance only depending on the actual values, so about the same for div r32 vs. div r64 with the same numbers.
BTW, the actual values tend to be something like a=1814246614 / b=1814246613
= 1, then a=1 % b=1814246612
(with b
decreasing by 1 every iteration). Only ever testing division with quotient=1 seems very silly. (The first iteration might be different, but we get into this state for the 2nd and later.)
Performance of integer operations other than division is not data-dependent on modern CPUs. (Unless of course there are compile-time constants that allow different asm to be emitted. Like division by a constant is much cheaper when done with a multiplicative inverse calculated at compile time.)
re: double
: see Floating point division vs floating point multiplication for division vs. multiplication. FP division is often harder to avoid, and its performance is relevant in more cases, so it's handled better.
Related:
div r64
to div r32
in a program that uses small-enough numbers, and seeing throughput improve ~3x.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With