For fun, I'm writing a bignum library in Rust. My goal (as with most bignum libraries) is to make it as efficient as I can. I'd like it to be efficient even on unusual architectures.
It seems intuitive to me that a CPU will perform arithmetic faster on integers with the native number of bits for the architecture (i.e., u64
for 64-bit machines, u16
for 16-bit machines, etc.) As such, since I want to create a library that is efficient on all architectures, I need to take the target architecture's native integer size into account. The obvious way to do this would be to use the cfg attribute target_pointer_width. For instance, to define the smallest type which will always be able to hold more than the maximum native int size:
#[cfg(target_pointer_width = "16")]
type LargeInt = u32;
#[cfg(target_pointer_width = "32")]
type LargeInt = u64;
#[cfg(target_pointer_width = "64")]
type LargeInt = u128;
However, while looking into this, I came across this comment. It gives an example of an architecture where the native int size is different from the pointer width. Thus, my solution will not work for all architectures. Another potential solution would be to write a build script which codegens a small module which defines LargeInt
based on the size of a usize
(which we can acquire like so: std::mem::size_of::<usize>()
.) However, this has the same problem as above, since usize
is based on the pointer width as well. A final obvious solution is to simply keep a map of native int sizes for each architecture. However, this solution is inelegant and doesn't scale well, so I'd like to avoid it.
So, my questions: is there a way to find the target's native int size, preferably before compilation, in order to reduce runtime overhead? Is this effort even worth it? That is, is there likely to be a significant difference between using the native int size as opposed to the pointer width?
It's generally hard (or impossible) to get compilers to emit optimal code for BigNum stuff, that's why https://gmplib.org/ has its low level primitive functions (mpn_...
docs) hand-written in assembly for various target architectures with tuning for different micro-architecture, e.g. https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/core2/mul_basecase.asm for the general case of multi-limb * multi-limb numbers. And https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/coreisbr/aors_n.asm for mpn_add_n
and mpn_sub_n
(Add OR Sub = aors), tuned for SandyBridge-family which doesn't have partial-flag stalls so it can loop with dec/jnz
.
Understanding what kind of asm is optimal may be helpful when writing code in a higher level language. Although in practice you can't even get close to that so it sometimes makes sense to use a different technique, like only using values up to 2^30 in 32-bit integers (like CPython does internally, getting the carry-out via a right shift, see the section about Python in this). In Rust you do have access to add_overflow
to get the carry-out, but using it is still hard.
For practical use, writing Rust bindings for GMP is probably your best bet, unless that already exists.
Using the largest chunks possible is very good; on all current CPUs, add reg64, reg64
has the same throughput and latency as add reg32, reg32
or reg8
. So you get twice as much work done per unit. And carry propagation through 64 bits of result in 1 cycle of latency.
(There are alternate ways to store BigInteger data that can make SIMD useful; @Mysticial explains in Can long integer routines benefit from SSE?. e.g. 30 value bits per 32-bit int, allowing you to defer normalization until after a few addition steps. But every use of such numbers has to be aware of these issues so it's not an easy drop-in replacement.)
In Rust, you probably want to just use u64
regardless of the target, unless you really care about small-number (single-limb) performance on 32-bit targets. Let the compiler build u64 operations for you out of add
/ adc
(add with carry).
The only thing that might need to be ISA-specific is if u128
is not available on some targets. You want to use 64 * 64 => 128-bit full multiply as your building block for multiplication; if the compiler can do that for you with u128
then that's great, especially if it inlines efficiently.
See also discussion in comments under the question.
One stumbling block for getting compilers to emit efficient BigInt addition loops (even inside the body of one unrolled loop) is writing an add that takes a carry input and produces a carry output. Note that x += 0xff..ff + carry=1
needs to produce a carry out even though 0xff..ff + 1
wraps to zero. So in C or Rust, x += y + carry
has to check for carry out in both the y+carry
and the x+=
parts.
It's really hard (probably impossible) to convince compiler back-ends like LLVM to emit a chain of adc instructions. An add/adc is doable when you don't need the carry-out from adc. Or probably if the compiler is doing it for you for u128.overflowing_add
Often compilers will turn the carry flag into a 0 / 1 in a register instead of using adc
. You can hopefully avoid that for at least pairs of u64
in addition by combining the input u64 values to u128 for u128.overflowing_add
. That will hopefully not cost any asm instructions because a u128
already has to be stored across two separate 64-bit registers, just like two separate u64
values.
So combining up to u128
could just be a local optimization for a function that adds arrays of u64
elements, to get the compiler to suck less.
In my library ibig what I do is:
target_arch
.target_pointer_width
.target_pointer_width
is not one of these values, use 64.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With