For my BigIntegers, in the PUREPASCAL implementation (i.e. no assembler allowed), I must multiply two UInt32
to get an UInt64
result.
The usual way to do this is to widen at least one of the operands, so you get a 64 bit multiplication:
Res := UInt64(A) * B;
where Res
is UInt64
and A
and B
are UInt32
.
But, In Win32, this produces a rather unwieldy piece of machine code:
MulTest.dpr.431: Res := UInt64(A) * B;
004DB463 8B45F8 mov eax,[ebp-$08] // load A
004DB466 33D2 xor edx,edx // make it UInt64
004DB468 52 push edx // push A
004DB469 50 push eax
004DB46A 8B45FC mov eax,[ebp-$04] // load B
004DB46D 33D2 xor edx,edx // make it UInt64
004DB46F E87C0AF3FF call @_llmul // 64 bit multiplication
004DB474 8945E8 mov [ebp-$18],eax // store 64 bit result
004DB477 8955EC mov [ebp-$14],edx
Now, if you just do:
Res := A * B;
you unfortunately get a 32 bit intermediate result (the top 32 bits of the actual result are simply zeroed out):
MulTest.dpr.435: Res := A * B;
004DB4BD 8B45FC mov eax,[ebp-$04]
004DB4C0 F76DF8 imul dword ptr [ebp-$08]
004DB4C3 33D2 xor edx,edx // zero out top 32 bits
004DB4C5 8945E8 mov [ebp-$18],eax
004DB4C8 8955EC mov [ebp-$14],edx
Now, if the line xor edx,edx
were not there, the result would be exactly what I need. This would be more than twice as fast (i.e. take less than half the time) as the widened version using the UInt64
cast.
Question: Does anyone know if there is a pseudo-function or trick or cast that does not discard the top 32 bits of the 64 bit result? I know how to do this in assembler, but this must be PUREPASCAL (it should work on other platforms too).
I managed to make 32 bit additions in PUREPASCAL much faster by accessing the array of 32 bit unisgned integers that makes up a BigInteger as an array of unsigned 16 bit integers and adding these instead. So I also tried multiplying using 16 bit intermediate results:
// Too slow: in a test, 2973 ms for Mul32(A, B) vs 1432 ms for UInt64(A) * B.
function MulU32ToU64(L, R: UInt32): UInt64; inline;
var
L0R0, L0R1, L1R0, L1R1, Sum: UInt32;
type
TUInt64 = packed record
case Byte of
0: (L0, L1, L2, L3: UInt16);
1: (I0, I1: UInt32);
end;
TUInt32 = packed record
Lo, Hi: Word;
end;
begin
L0R0 := TUInt32(L).Lo * TUInt32(R).Lo;
L0R1 := TUInt32(L).Lo * TUInt32(R).Hi;
L1R0 := TUInt32(L).Hi * TUInt32(R).Lo;
L1R1 := TUInt32(L).Hi * TUInt32(R).Hi;
TUInt64(Result).L0 := TUInt32(L0R0).Lo;
Sum := UInt32(TUInt32(L0R0).Hi) + TUInt32(L1R0).Lo + TUInt32(L0R1).Lo;
TUInt64(Result).L1 := TUInt32(Sum).Lo;
Sum := UInt32(TUInt32(Sum).Hi) + TUInt32(L1R0).Hi + TUInt32(L0R1).Hi + L1R1;
TUInt64(Result).I1 := Sum;
end;
It gives me the correct result, but is more than twice as slow as UInt64(A) * B
. That is no surprise, as it does 4 UInt32 multiplications and a lot of additions, which makes it slower than the code using System.__llmul
.
As @J... pointed out, Delphi generally uses IMUL
, which does a signed multiplication. So the multiplication of e.g. $00000002
and $FFFFFFFF
results in EAX = $FFFFFFFE
and EDX = $FFFFFFFF
(in other words, an Int64
with value -2
), while I would need EAX = $FFFFFFFE
(the same), but EDX = $00000001
(together a UInt64
with value $00000001FFFFFFFE
). So it is right that the top 32 bits are discarded, and there seems to be no way to coerce Delphi into using MUL
and keeping the top 32 bits of the result of that.
There is a Windows macro UInt32x32To64(a, b) to multiply two unsigned 32-bit values and get a 64-bit result.
If you need a pure pascal, you have to assign both of your 32-bit unsigned values to 64-bit unsigned values and then multiply them.
function UInt32x32To64(x, y: UInt32): UInt64;
var
xl, yl: UInt64;
begin
xl := x;
yl := y;
Result := xl * yl;
end;
Here is a code sample that checks this function. This code does also have an assembly function just for comparison. You don't need it because you have PurePascal, but it is implemented very efficiently - just one mul
instruction. It is a special form of mul
which takes just one argument, and the other comes from the eax
register. The resulting 64-bit value is stored in edx:eax. So it is implemented more efficiently than casting 32-bit values to 64-bit and then multiplyng them, since Delphi calls __llmul
for that from the System.pas, which does 3 mul
instructions, each of them is costly.
program TestMultiply;
{$APPTYPE CONSOLE}
function UInt32x32To64_asm(x, y: UInt32): UInt64; assembler;
asm
{$IFDEF WIN32}
mul edx
{$ELSE}
mov eax, ecx
mul edx
shl rdx, 32
or rax, rdx
{$ENDIF}
end;
function UInt32x32To64_purepascal(x, y: UInt32): UInt64;
var
xl, yl: UInt64;
begin
xl := x;
yl := y;
Result := xl * yl;
end;
var
aa, bb: UInt32;
cc, cc_test: UInt64;
begin
aa := 2147483647;
bb := 2148736590;
cc_test := 4614376688735543730;
cc := UInt32x32To64_asm(aa, bb);
if cc <> cc_test then
WriteLn('Error')
else
WriteLn('OK');
cc := UInt32x32To64_purepascal(aa, bb);
if cc <> cc_test then
WriteLn('Error')
else
WriteLn('OK');
end.
Here is a Python3 code to check:
aa = 2147483647
bb = 2148736590
cc = aa * bb
print(cc)
print("Error") if cc != 4614376688735543730 else print("OK!")
MulTest.dpr.435: Res := A * B;
004DB4BD 8B45FC mov eax,[ebp-$04]
004DB4C0 F76DF8 imul dword ptr [ebp-$08]
004DB4C3 33D2 xor edx,edx // zero out top 32 bits
004DB4C5 8945E8 mov [ebp-$18],eax
004DB4C8 8955EC mov [ebp-$14],edx
Now, if the line xor edx,edx were not there, the result would be exactly what I need.
No, it isn't what you want at all. That's a signed multiply and the result is nonsense if you want an unsigned result. Make A:=$FFFFFFFF
and B:=2
- the result of imul
is EAX = FFFFFFFE
and EDX = FFFFFFFF
. This opcode is emitted even with two unsigned operands. You want the mul
instruction, not imul
. I don't think the delphi compiler will ever emit mul
from pure pascal. From the documentation on *
(emphasis mine)
The value of x / y is of type Extended, regardless of the types of x and y. For other arithmetic operators, the result is of type Extended whenever at least one operand is a real; otherwise, the result is of type Int64 when at least one operand is of type Int64; otherwise, the result is of type Integer.
Integer - signed. Given how dependent this is on the idiosyncracies of the architecture, and given the deficiencies of the delphi compilers, I think the only performant solution here is going to be target-dependent assembly.
function UMul3264(x, y : UInt32) : UInt64;
asm
mul eax, edx
end;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With