I need to do the following operation many times:
a, b
a * b mod p
, where p = 1000000007
and a, b
are of the same order of magnitude as p
My gut feeling is the naive
result = a * b
result %= p
is inefficient. Can I optimise multiplication modulo p
much like exponentiation modulo p
is optimised with pow(a, b, p)
?
You mention that "a, b
are of the same order of magnitude as p." Often in cryptography this means that a,b
are large numbers near p
, but strictly less-than p
.
If this is the case, then you could use the simple identity
to turn your calculation into
result = ((a-p)*(b-p))%p
You've then turned one large multiplication into two large subtractions and a small multiplication. You'll have to profile to see which is faster.
To do this calculation in assembly, but have it callable from Python, I'd try inline assembly from a Python module written in C. Both GCC and MSVC compilers feature inline assembly, only with differing syntax.
Note that our modulus p = 1000000007
just fits into 30-bits. The result
desired (a*b)%p
can be computed in Intel 80x86 registers given some weak
restrictions on a,b
not being much bigger than p
.
Restrictions on size of a,b
(1) a,b
are 32-bit unsigned integers
(2) a*b
is less than p << 32
, i.e. p
times 2^32
In particular if a,b
are each less than 2*p
, overflow will be avoided.
Given (1), it also suffices that either one of them is less than p
.
The Intel 80x86 instruction MUL can multiply two 32-bit unsigned integers and store the 64-bit result in accumulator register pair EDX:EAX. Some details and quirks of MUL are discussed in Section 10.2.1 of this helpful summary.
The instruction DIV can then divide this 64-bit result by a 32-bit constant
(the modulus p
), storing the quotient in EAX and the remainder in EDX.
See Section 10.2.2 of the last link. The result we want is that remainder.
It is this division instruction DIV that entails a risk of overflow, should the 64-bit product in numerator EDX:EAX give a quotient larger than 32-bits by failing to satisfy (2) above.
I'm working on a code snippet in C/inline assembly for "proof of concept".
However the maximum benefit in speed will depend on batching up arrays of
data a,b
to process, amortizing the overhead of function calls, etc. in
Python (if that is the target platform).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With