Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is fastest x86-64 assembly-language divide algorithm for huge numbers?

I am writing a code library in x86-64 assembly-language to provide all conventional bitwise, shift, logical, compare, arithmetic and math functions for s0128, s0256, s0512, s1024 signed-integer types and f0128, f0256, f0512, f1024 floating-point types. So far I'm working on the signed-integer types, because the floating-point functions will likely call some internal routines written for the integer types.

So far I've written and tested functions to perform the various unary operators, compare operators, and the add, subtract and multiply operators.

Now I'm trying to decide how to implement functions for the divide operators.

My first thought was, "Newton-Raphson must be the best approach". Why? Because it converges very quickly given a good seed (starting guess), and I figure I should be able to figure out how to execute the native 64-bit divide instruction on the operands to get an excellent seed value. In fact, if the seed value is precise to 64-bits, to get the correct answers should only take:

`s0128` : 1~2 iterations : (or 1 iteration  plus 1~2 "test subtracts")
`s0256` : 2~3 iterations : (or 2 iterations plus 1~2 "test subtracts")
`s0512` : 3~4 iterations : (or 3 iterations plus 1~2 "test subtracts")
`s1024` : 4~5 iterations : (or 4 iterations plus 1~2 "test subtracts")

However, a bit more thinking about this question makes me wonder. For example, I recall the core routine I wrote that performs the multiply operation for all the large integer types:

s0128 :   4 iterations ==   4 (128-bit = 64-bit * 64-bit) multiplies +  12 adds
s0256 :  16 iterations ==  16 (128-bit = 64-bit * 64-bit) multiplies +  48 adds
s0512 :  64 iterations ==  64 (128-bit = 64-bit * 64-bit) multiplies + 192 adds
s1024 : 256 iterations == 256 (128-bit = 64-bit * 64-bit) multiplies + 768 adds

The growth in operations for the wider data-types is quite substantial, even though the loop is fairly short and efficient (including cache-wise). This loop writes each 64-bit portion of the result only once, and never reads back any portion of the result for further processing.

This got me thinking about whether more conventional shift-and-subtract type divide algorithms might be faster, especially for the larger types.

My first thought was this:

result = dividend / divisor                  // if I remember my terminology
remainder = dividend - (result * divisor)    // or something along these lines

#1: To compute each bit, generally the divisor is subtracted from the dividend IF the divisor is less than or equal to the dividend. Well, usually we can determine the divisor is definitely less-than or definitely greater-than the dividend by only inspecting their most-significant 64-bit portions. Only when those ms64-bit portions are equal must the routine check the next lower 64-bit portions, and only when they are equal must we check even lower, and so forth. Therefore, on almost every iteration (computing each bit of result), we can greatly reduce the instructions executed to compute this test.

#2: However... on average, about 50% of the time we will find we need to subtract the divisor from the dividend, so we will need to subtract their entire widths anyway. In this case we actually executed more instructions than we would have in the conventional approach (where we first subtract them, then test the flags to determine whether the divisor <= dividend). Therefore, half the time we realize a saving, and half the time we realize a loss. On large types like s1024 (which contains -16- 64-bit components), savings are substantial and losses are small, so this approach should realize a large net savings. On tiny types like s0128 (which contains -2- 64-bit components), savings are tiny and losses significant but not huge.


So, my question is, "what are the most efficient divide algorithms", given:

#1: modern x86-64 CPUs like FX-8350
#2: executing in 64-bit mode only (no 32-bit)
#3: implementation entirely in assembly-language
#4: 128-bit to 1024-bit integer operands (nominally signed, but...)

NOTE: My guess is, the actual implementation will operate only on unsigned integers. In the case of multiply, it turned out to be easier and more efficient (maybe) to convert negative operands to positive, then perform unsigned-multiply, then negate the result if exactly one original operand was negative. However, I'll consider a signed-integer algorithm if it is efficient.

NOTE: If the best answers are different for my floating-point types (f0128, f0256, f0512, f1024), please explain why.

NOTE: My internal core unsigned-multiply routine, which performs the multiply operation for all these integer data-types, produces a double-width result. In other words:

u0256 = u0128 * u0128     // cannot overflow
u0512 = u0256 * u0256     // cannot overflow
u1024 = u0512 * u0512     // cannot overflow
u2048 = u1024 * u1024     // cannot overflow

My code library offers two versions of multiply for each signed-integer data-type:

s0128 = s0128 * s0128     // can overflow (result not fit in s0128)
s0256 = s0256 * s0256     // can overflow (result not fit in s0256)
s0512 = s0512 * s0512     // can overflow (result not fit in s0512)
s1024 = s1024 * s1024     // can overflow (result not fit in s1024)

s0256 = s0128 * s0128     // cannot overflow
s0512 = s0256 * s0256     // cannot overflow
s1024 = s0512 * s0512     // cannot overflow
s2048 = s1024 * s1024     // cannot overflow

This is consistent with the policy of my code library to "never lose precision" and "never overflow" (errors are returned when the answer is invalid due to precision-loss or due to overflow/underflow). However, when double-width return value functions are called, no such errors can occur.

like image 256
honestann Avatar asked Dec 27 '13 08:12

honestann


3 Answers

Surely you know about the existing arbitrary precision packages (e.g, http://gmplib.org/) and how they operate? They are generally designed to run "as fast as possible" for arbitrary precisions.

If you specialized them for fixed sizes (e.g, applied [manually] partial evaluation techniques to fold constants and unroll loops) I'd expect you to get pretty good routines for specific fixed-size precisions of the kind you want.

Also if you haven't seen it, check out D. Knuth's Seminumerical Algorithms, and oldie but really goodie, which provides key algorithms for multi-precision arithmetic. (Most of the packages are based on these ideas but Knuth has great explanations and an awful lot right).

The key idea is to treat multi-precision numbers as if they were very-big-radix numbers (e.g., radix 2^64), and apply standard 3rd-grade arithmetic to the "digits" (e.g. 64-bit words). Divide consists of "estimate quotient (big-radix) digit, multiply estimate by divisor, subtract from dividend, shift left one digit, repeat" until you get enough digits to satisfy you. For division, yes, its all unsigned (doing sign handling in wrappers). The basic trick is estimating-a-quotient digit well (using single precision instructions the processor provides you), and doing fast multi-precision multiplies by single digits. See Knuth for details. See technical research papers on multi-precision arithmetic (you get to do some research) for exotic ("fastest possible") improvements.

like image 122
Ira Baxter Avatar answered Oct 17 '22 08:10

Ira Baxter


The "large-radix" approaches are more efficient for the kinds of huge data-types you mention, especially if you can execute 128-bit divided by 64-bit instructions in assembly-language.

While Newton-Raphson iteration does converge quickly, each iteration requires too huge a number of multiply and accumulate steps for each iteration.

like image 40
sophia Avatar answered Oct 17 '22 08:10

sophia


For the multiplication, have a look here:

http://www.math.niu.edu/~rusin/known-math/99/karatsuba http://web.archive.org/web/20141114071302/http://www.math.niu.edu/~rusin/known-math/99/karatsuba

Basically, it allows doing a 1024 x 1024 multiplication using three (instead of four) 512 x 512 bit multiplications. Or nine 256 x 256 bit, or twenty-seven 128 x 128 bit. The added complexity might not beat brute force even for 1024 x 1024, but probably for bigger products. That's the simplest of the "fast" algorithms, using n ^ (log 3 / log 2) = n^1.585 multiplications.

I'd advice to not use assembler. Implement 64 x 64 -> 128 bit multiplication with inline assembler, same with add-with-carry (I think gcc and clang might have built-in operations for this nowadays); then you can for example multiply n bits x 256 bits (any number of words times 4 words) in parallel, avoiding all the latency of multiplication, without going mad with assembler.

like image 39
gnasher729 Avatar answered Oct 17 '22 08:10

gnasher729