In Delphi math.pas unit there is a procedure DivMod that i want to convert it into inline and optimize it for divisor to be always 10 . But I dont know details of Pentagon ASM . What is the conversion of bellow procedure <pre class="prettyprint"><code> procedure DivMod(Dividend: Integer; Divisor: Word; var Result, Remainder: Word); asm PUSH EBX MOV EBX,EDX MOV EDX,EAX SHR EDX,16 DIV BX MOV EBX,Remainder MOV [ECX],AX MOV [EBX],DX POP EBX end; </code></pre>

By far the most important optimization you can do is use a fixed-point multiplicative inverse for division by a compile-time constant: Why does GCC use multiplication by a strange number in implementing integer division?. Any decent C compiler will do that for you, but apparently Delphi won't, so there is a valid reason for doing it with asm. Can you return a value in EAX instead of storing both the quotient and remainder to memory? Seems like a waste to pass 2 pointer args, and force the caller to retrieve the value from memory. (Update, yes I think you can by making it a function instead of a procedure; I'm just blindly modifying Delphi code from other answers, though.) Anyway, fortunately we can get a C compiler to do the hard work of figuring out the multiplicative inverse and the shift counts for us. We can even get it to use the same "calling convention" that it looks like Delphi is using for inline asm. GCC's <code>regparm=3</code> 32-bit calling convention passes args in EAX, EDX, and ECX (in that order). You might want to make a separate version for cases where you only need the quotient, because (unlike the slow <code>div</code> instruction), you have to compute the remainder separately as <code>x - (x/y)*y</code> if you're using a fast multiplicative inverse. But yes that's still about twice to 4x as fast on modern x86. Or you could leave the remainder calculation to be done in pure Delphi, unless the compiler is just terrible at optimizing in general. <pre class="prettyprint lang-c prettyprint-override"><code>#ifdef _MSC_VER #define CONVENTION _fastcall // not the same, but 2 register args are better than none. #else #define CONVENTION __attribute__((regparm(3))) #endif // use gcc -Os to get it to emit code with actual div. divmod10(unsigned x, unsigned *quot, unsigned *rem) { unsigned tmp = x/10; // *quot = tmp; *rem = x%10; return tmp; } </code></pre> From the Godbolt compiler explorer: <pre class="prettyprint"><code># gcc8.2 -O3 -Wall -m32 div10: # simplified version without the remainder, returns in EAX mov edx, -858993459 # 0xCCCCCCCD mul edx # EDX:EAX = dividend * 0xCCCCCCCD mov eax, edx shr eax, 3 ret # quotient in EAX # returns quotient in EAX, stores remainder to [ECX] # quotient pointer in EDX is unused (and destroyed). divmod10: mov edx, -858993459 push ebx mov ebx, eax mul edx # EDX:EAX = dividend * 0xCCCCCCCD mov eax, edx shr eax, 3 # quotient in EAX = high_half(product) >> 3 = product >> (32+3) lea edx, [eax+eax*4] # EDX = quotient*5 add edx, edx # EDX = quot * 10 sub ebx, edx # remainder = dividend - quot*10 mov DWORD PTR [ecx], ebx # store remainder pop ebx ret # quotient in EAX </code></pre> This is C compiler output. Adapt as necessary to Delphi inline asm; the inputs are in the right registers for Delphi, I think. If Delphi inline-asm doesn't let you clobber EDX, you can save/restore it. Or you want to remove the unused <code>quotient</code> pointer input, then you can adjust the asm, or adjust the C on Godbolt and look at the new compiler output. This is more instructions than with <code>div</code>, but <code>div</code> is very slow (10 uops, and 26 cycle latency even on Skylake.) If you have a 64-bit integer type in Delphi, you can do this in Delphi source and avoid inline asm. Or as MBo shows, you can use <code>$CCCD</code> as a multiplicative inverse for inputs that are in the 0..2^16-1 range using only 32-bit integer types. For the remainder, the store/reload round trip (4 to 5 cycles) has similar latency to the actual calculation on a recent Intel CPU with mov-elimination (3 + 1 to quotient, + another 3 for the lea/add/sub = 7), so having to use inline asm for this is pretty crap. But it's still better than a <code>div</code> instruction for latency and throughput. See https://agner.org/optimize/ and other performance links in the x86 tag wiki. <hr> <h3>Delphi version you can copy/paste</h3> (If I got this right, I don't know Delphi, and just copied+modified examples here on SO and this site, based on what I infer about the calling-convention / syntax) I'm not sure I got the arg-passing right for inline-asm. This RADStudio documentation says "Except for ESP and EBP, an asm statement can assume nothing about register contents on entry to the statement." But I'm assuming args are in EAX and EDX. Using asm for 64-bit code might be silly, because in 64-bit you can efficiently use pure Pascal for 64-bit multiplication. How do I implement an efficient 32 bit DivMod in 64 bit code. So in the <code>{$IFDEF CPUX64}</code> blocks, the best choice might be pure pascal using <code>UInt64(3435973837)*num;</code> <pre class="prettyprint"><code>function Div10(Num: Cardinal): Cardinal; {$IFDEF PUREPASCAL} begin Result := Num div 10; end; {$ELSE !PUREPASCAL} {$IFDEF CPUX86} asm MOV EDX, $CCCCCCCD MUL EDX // EDX:EAX = Num * fixed-point inverse MOV EAX,EDX // mov then overwrite is ideal for Intel mov-elimination SHR EAX,3 end; {$ENDIF CPUX86} {$IFDEF CPUX64} asm // TODO: use pure pascal for this; Uint64 is efficient on x86-64 // Num in ECX, upper bits of RCX possibly contain garbage? mov eax, ecx // zero extend Num into RAX mov ecx, $CCCCCCCD // doesn't quite fit in a sign-extended 32-bit immediate for imul imul rax, rcx // RAX = Num * fixed-point inverse shr rax, 35 // quotient = eax end; {$ENDIF CPUX64} {$ENDIF} {Remainder is the function return value} function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal; {$IFDEF PUREPASCAL} begin Quotient := Num div 10; Result := Num mod 10; end; {$ELSE !PUREPASCAL} {$IFDEF CPUX86} asm // Num in EAX, @Quotient in EDX push esi mov ecx, edx // save @quotient mov edx, $CCCCCCCD mov esi, eax // save dividend for use in remainder calc mul edx // EDX:EAX = dividend * 0xCCCCCCCD shr edx, 3 // EDX = quotient mov [ecx], edx // store quotient into @quotient lea edx, [edx + 4*edx] // EDX = quot * 5 add edx, edx // EDX = quot * 10 mov eax, esi // off the critical path sub eax, edx // Num - (Num/10)*10 pop esi // Remainder in EAX = return value end; {$ENDIF CPUX86} {$IFDEF CPUX64} asm // TODO: use pure pascal for this? Uint64 is efficient on x86-64 // Num in ECX, @Quotient in RDX mov r8d, ecx // zero-extend Num into R8 mov eax, $CCCCCCCD imul rax, r8 shr rax, 35 // quotient in eax lea ecx, [rax + 4*rax] add ecx, ecx // ecx = 10*(Num/10) mov [rdx], eax // store quotient mov eax, r8d // copy Num again sub eax, ecx // remainder = Num - 10*(Num/10) // we could have saved 1 mov instruction by returning the quotient // and storing the remainder. But this balances latency better. end; {$ENDIF CPUX64} {$ENDIF} </code></pre> Storing the quotient and returning the remainder means that both might be ready at about the same time in the caller, because the extra latency of computing the remainder from the quotient overlaps with the store-forwarding. IDK if that's good, or if having out-of-order execution get started on some work based on the quotient is more often better. I'm going to guess that if you call DivMod10, you might only want the remainder. But in a split-into-decimal-digits loop that repeatedly divides by 10, it's the quotient that forms the critical path, so a version of this that returned the quotient and stored the remainder would be a much better choice there. In that case you'd make the quotient the return value in EAX, and rename the function arg to the remainder. <hr> The asm is based on clang output for this version of this C function (https://godbolt.org/z/qu2kvV), targeting the Windows x64 calling convention. But with some tweaks to make it more efficient, e.g. taking <code>mov</code> off the critical path, and using different registers to avoid REX prefixes. And replacing one LEA with just an ADD. <pre class="prettyprint lang-c prettyprint-override"><code>unsigned divmod10(unsigned x, unsigned *quot) { unsigned qtmp = x/10; unsigned rtmp = x%10; *quot = qtmp; //*rem = rtmp; return rtmp; } </code></pre> I used clang's version instead of gcc's because <code>imul r64,r64</code> is faster on Intel CPUs and Ryzen (3 cycle latency / 1 uop). <code>mul r32</code> is 3 uops, and only 1 per 2 clocks throughput on Sandybridge-family. I think the multiply hardware naturally produces a 128-bit result, and splitting the low 64 of that into edx:eax takes an extra uop, or something like that.

Following on from this answer, you can get some of the performance back in a 32-bit compile by leveraging a hardware 32x32 -> 64bit multiply using SSE: <pre class="prettyprint"><code>program Project1; {$APPTYPE CONSOLE} uses Windows, SysUtils; procedure DivMod10(num : Cardinal; var q, r : Cardinal); const m : cardinal = 3435973837; asm movd xmm0, m {move magic number to xmm0} movd xmm1, eax {move num to xmm1} pmuludq xmm0, xmm1 {xmm0[0:32] * xmm1[0:32] -> xmm0[0:64] unsigned} psrlq xmm0, 35 {right shift xmm0} movss [edx], xmm0 {store quotient to q} movd edx, xmm0 {recycle edx, store q} imul edx, -$A {edx = q * (-10)} add edx, eax {edx = r} mov [ecx], edx {store r} end; var q, r, t0, i : cardinal; begin t0 := GetTickCount; for I := 1 to 999999999 do DivMod10(i, q, r); WriteLn('SSE ASM : ' + IntToStr(GetTickCount - t0)); t0 := GetTickCount; for I := 1 to 999999999 do q := i div 10; WriteLn('div : ' + IntToStr(GetTickCount - t0)); WriteLn('Test correctness...'); for I := 1 to High(Cardinal) do begin DivMod10(i,q,r); if (q <> (i div 10)) or (r <> (i mod 10)) then WriteLn('Incorrect Result : ' + IntToStr(i)); end; WriteLn('Test complete.'); Readln; end. </code></pre> This produces : <blockquote> SSE ASM : 2449 div : 3401 Test correctness... Test complete. </blockquote> This is not generally safe since you should be checking at runtime if the CPU supports the required SSE instructions (and have a purepascal alternative in place for that case), it is nevertheless increasingly rare to find CPUs alive and working that are old enough to not support at least SSE2. For systems that do support it, this can be more performant than <code>div</code> (I see about a 25% performance benefit using <code>DivMod10</code> on Haswell, for example), and you get the remainder. Not as fast as a native 64-bit <code>IMUL</code>, but still quite useful. <hr> To address Peter's comments, consider the pure x86 version : <pre class="prettyprint"><code>procedure DivMod10(num : Cardinal; var q, r : Cardinal); const m : cardinal = 3435973837; asm push eax push edx mul m mov eax, edx shr eax, 3 pop edx mov [edx], eax pop eax imul edx, [edx], -$A add edx, eax mov [ecx], edx end; </code></pre> which produces (for me - Haswell i7) : <blockquote> x86 ASM : 2948 div : 3401 Test correctness... Test complete. </blockquote> Which is about 18% slower than the SSE version. <hr> With some good ideas from Peter, we can optimize the pure x86 version a bit further, saving a register by converting to a function and replacing the immediate <code>imul</code> with <code>lea</code> and <code>add</code> : <pre class="prettyprint"><code>function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal; const m : cardinal = 3435973837; asm mov ecx, eax {save num to ecx} push edx {save quotient pointer} mul m {edx:eax = m*Num} shr edx, 3 {edx = quotient} pop eax {restore quotient pointer} mov [eax], edx {store quotient} mov eax, ecx {restore num to eax} lea ecx, [edx +4*edx] {ecx = q*5} add ecx, ecx {ecx = q*10} sub eax, ecx {return remainder in eax} end; </code></pre> This gets execution time (same conditions as above) down to <code>2637ms</code>, but still not quite as quick as the SSE version. The <code>imul</code> to <code>lea</code> optimization is minor and optimizes latency over throughput - this can be applied to all algorithms depending on the end use environment.

How to optimize DivMod for a constant divisor of 10

Tags:

x86

micro-optimization

delphi

inline-assembly

In Delphi math.pas unit there is a procedure DivMod that i want to convert it into inline and optimize it for divisor to be always 10 . But I dont know details of Pentagon ASM . What is the conversion of bellow procedure

 procedure DivMod(Dividend: Integer; Divisor: Word;
  var Result, Remainder: Word);
asm
        PUSH    EBX
        MOV     EBX,EDX
        MOV     EDX,EAX
        SHR     EDX,16
        DIV     BX
        MOV     EBX,Remainder
        MOV     [ECX],AX
        MOV     [EBX],DX
        POP     EBX
end;

461

asked Oct 29 '18 04:10

Omid Motahed

2 Answers

By far the most important optimization you can do is use a fixed-point multiplicative inverse for division by a compile-time constant: Why does GCC use multiplication by a strange number in implementing integer division?.

Any decent C compiler will do that for you, but apparently Delphi won't, so there is a valid reason for doing it with asm.

Can you return a value in EAX instead of storing both the quotient and remainder to memory? Seems like a waste to pass 2 pointer args, and force the caller to retrieve the value from memory. (Update, yes I think you can by making it a function instead of a procedure; I'm just blindly modifying Delphi code from other answers, though.)

Anyway, fortunately we can get a C compiler to do the hard work of figuring out the multiplicative inverse and the shift counts for us. We can even get it to use the same "calling convention" that it looks like Delphi is using for inline asm. GCC's regparm=3 32-bit calling convention passes args in EAX, EDX, and ECX (in that order).

You might want to make a separate version for cases where you only need the quotient, because (unlike the slow div instruction), you have to compute the remainder separately as x - (x/y)*y if you're using a fast multiplicative inverse. But yes that's still about twice to 4x as fast on modern x86.

Or you could leave the remainder calculation to be done in pure Delphi, unless the compiler is just terrible at optimizing in general.

#ifdef _MSC_VER
#define CONVENTION  _fastcall   // not the same, but 2 register args are better than none.
#else
#define CONVENTION __attribute__((regparm(3)))
#endif

// use gcc -Os to get it to emit code with actual div.

divmod10(unsigned x, unsigned *quot, unsigned *rem) {
    unsigned tmp = x/10;
    // *quot = tmp;
    *rem = x%10;
    return tmp;
}

From the Godbolt compiler explorer:

# gcc8.2  -O3 -Wall -m32
div10:    # simplified version without the remainder, returns in EAX
        mov     edx, -858993459     # 0xCCCCCCCD
        mul     edx                 # EDX:EAX = dividend * 0xCCCCCCCD
        mov     eax, edx
        shr     eax, 3
        ret
      # quotient in EAX

# returns quotient in EAX, stores remainder to [ECX]
# quotient pointer in EDX is unused (and destroyed).
divmod10:
        mov     edx, -858993459
        push    ebx
        mov     ebx, eax
        mul     edx                      # EDX:EAX = dividend * 0xCCCCCCCD
        mov     eax, edx
        shr     eax, 3
        # quotient in EAX = high_half(product) >> 3 = product >> (32+3)
        lea     edx, [eax+eax*4]         # EDX = quotient*5
        add     edx, edx                 # EDX = quot * 10
        sub     ebx, edx                 # remainder = dividend - quot*10
        mov     DWORD PTR [ecx], ebx     # store remainder
        pop     ebx
        ret
        # quotient in EAX

This is C compiler output. Adapt as necessary to Delphi inline asm; the inputs are in the right registers for Delphi, I think.

If Delphi inline-asm doesn't let you clobber EDX, you can save/restore it. Or you want to remove the unused quotient pointer input, then you can adjust the asm, or adjust the C on Godbolt and look at the new compiler output.

This is more instructions than with div, but div is very slow (10 uops, and 26 cycle latency even on Skylake.)

If you have a 64-bit integer type in Delphi, you can do this in Delphi source and avoid inline asm. Or as MBo shows, you can use $CCCD as a multiplicative inverse for inputs that are in the 0..2^16-1 range using only 32-bit integer types.

For the remainder, the store/reload round trip (4 to 5 cycles) has similar latency to the actual calculation on a recent Intel CPU with mov-elimination (3 + 1 to quotient, + another 3 for the lea/add/sub = 7), so having to use inline asm for this is pretty crap. But it's still better than a div instruction for latency and throughput. See https://agner.org/optimize/ and other performance links in the x86 tag wiki.

Delphi version you can copy/paste

(If I got this right, I don't know Delphi, and just copied+modified examples here on SO and this site, based on what I infer about the calling-convention / syntax)

I'm not sure I got the arg-passing right for inline-asm. This RADStudio documentation says "Except for ESP and EBP, an asm statement can assume nothing about register contents on entry to the statement." But I'm assuming args are in EAX and EDX.

Using asm for 64-bit code might be silly, because in 64-bit you can efficiently use pure Pascal for 64-bit multiplication. How do I implement an efficient 32 bit DivMod in 64 bit code. So in the {$IFDEF CPUX64} blocks, the best choice might be pure pascal using UInt64(3435973837)*num;

function Div10(Num: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
  Result := Num div 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
        MOV     EDX, $CCCCCCCD
        MUL     EDX                   // EDX:EAX = Num * fixed-point inverse
        MOV     EAX,EDX               // mov then overwrite is ideal for Intel mov-elimination
        SHR     EAX,3
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
         // TODO: use pure pascal for this; Uint64 is efficient on x86-64
        // Num in ECX, upper bits of RCX possibly contain garbage?
        mov     eax, ecx              // zero extend Num into RAX
        mov     ecx, $CCCCCCCD        // doesn't quite fit in a sign-extended 32-bit immediate for imul
        imul    rax, rcx              // RAX = Num * fixed-point inverse
        shr     rax, 35               // quotient = eax
end;
{$ENDIF CPUX64}
{$ENDIF}

 {Remainder is the function return value}
function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
  Quotient := Num div 10;
  Result := Num mod 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
    // Num in EAX,  @Quotient in EDX
    push    esi
    mov     ecx, edx           // save @quotient
    mov     edx, $CCCCCCCD
    mov     esi, eax           // save dividend for use in remainder calc
    mul     edx                // EDX:EAX = dividend * 0xCCCCCCCD
    shr     edx, 3             // EDX = quotient
    mov     [ecx], edx         // store quotient into @quotient

    lea     edx, [edx + 4*edx] // EDX = quot * 5
    add     edx, edx           // EDX = quot * 10
    mov     eax, esi                  // off the critical path
    sub     eax, edx           // Num - (Num/10)*10
    pop     esi
    // Remainder in EAX = return value
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
        // TODO: use pure pascal for this?  Uint64 is efficient on x86-64
    // Num in ECX,   @Quotient in RDX
    mov     r8d, ecx          // zero-extend Num into R8
    mov     eax, $CCCCCCCD
    imul    rax, r8
    shr     rax, 35           // quotient in eax

    lea     ecx, [rax + 4*rax]
    add     ecx, ecx          // ecx = 10*(Num/10)
    mov     [rdx], eax        // store quotient

    mov     eax, r8d          // copy Num again
    sub     eax, ecx          // remainder = Num - 10*(Num/10)
    // we could have saved 1 mov instruction by returning the quotient
    // and storing the remainder.  But this balances latency better.
end;
{$ENDIF CPUX64}
{$ENDIF}

Storing the quotient and returning the remainder means that both might be ready at about the same time in the caller, because the extra latency of computing the remainder from the quotient overlaps with the store-forwarding. IDK if that's good, or if having out-of-order execution get started on some work based on the quotient is more often better. I'm going to guess that if you call DivMod10, you might only want the remainder.

But in a split-into-decimal-digits loop that repeatedly divides by 10, it's the quotient that forms the critical path, so a version of this that returned the quotient and stored the remainder would be a much better choice there.

In that case you'd make the quotient the return value in EAX, and rename the function arg to the remainder.

The asm is based on clang output for this version of this C function (https://godbolt.org/z/qu2kvV), targeting the Windows x64 calling convention. But with some tweaks to make it more efficient, e.g. taking mov off the critical path, and using different registers to avoid REX prefixes. And replacing one LEA with just an ADD.

unsigned divmod10(unsigned x, unsigned *quot) {
    unsigned qtmp = x/10;
    unsigned rtmp = x%10;
     *quot = qtmp;
     //*rem = rtmp;
    return rtmp;
}

I used clang's version instead of gcc's because imul r64,r64 is faster on Intel CPUs and Ryzen (3 cycle latency / 1 uop). mul r32 is 3 uops, and only 1 per 2 clocks throughput on Sandybridge-family. I think the multiply hardware naturally produces a 128-bit result, and splitting the low 64 of that into edx:eax takes an extra uop, or something like that.

answered Oct 01 '22 20:10

Peter Cordes

Following on from this answer, you can get some of the performance back in a 32-bit compile by leveraging a hardware 32x32 -> 64bit multiply using SSE:

program Project1;
{$APPTYPE CONSOLE}

uses
  Windows, SysUtils;

procedure DivMod10(num : Cardinal; var q, r : Cardinal);
const
  m : cardinal = 3435973837;
asm
  movd xmm0, m         {move magic number to xmm0}
  movd xmm1, eax       {move num to xmm1}
  pmuludq xmm0, xmm1   {xmm0[0:32] * xmm1[0:32] -> xmm0[0:64] unsigned}
  psrlq xmm0, 35       {right shift xmm0}
  movss [edx], xmm0    {store quotient to q}
  movd edx, xmm0       {recycle edx, store q}
  imul edx, -$A        {edx = q * (-10)}
  add edx, eax         {edx = r}
  mov [ecx], edx       {store r}
end;

var
  q, r, t0, i : cardinal;
begin
  t0 := GetTickCount;
  for I := 1 to 999999999 do DivMod10(i, q, r);
  WriteLn('SSE ASM : ' + IntToStr(GetTickCount - t0));

  t0 := GetTickCount;
  for I := 1 to 999999999 do q := i div 10;
  WriteLn('div : ' + IntToStr(GetTickCount - t0));

  WriteLn('Test correctness...');
  for I := 1 to High(Cardinal) do begin
    DivMod10(i,q,r);
    if (q <> (i div 10)) or (r <> (i mod 10)) then
      WriteLn('Incorrect Result : ' + IntToStr(i));
  end;

  WriteLn('Test complete.');
  Readln;
end.

This produces :

SSE ASM : 2449
div : 3401
Test correctness...
Test complete.

This is not generally safe since you should be checking at runtime if the CPU supports the required SSE instructions (and have a purepascal alternative in place for that case), it is nevertheless increasingly rare to find CPUs alive and working that are old enough to not support at least SSE2.

For systems that do support it, this can be more performant than div (I see about a 25% performance benefit using DivMod10 on Haswell, for example), and you get the remainder. Not as fast as a native 64-bit IMUL, but still quite useful.

To address Peter's comments, consider the pure x86 version :

procedure DivMod10(num : Cardinal; var q, r : Cardinal);
const
  m : cardinal = 3435973837;
asm
  push eax
  push edx
  mul m
  mov eax, edx
  shr eax, 3
  pop edx
  mov [edx], eax
  pop eax
  imul edx, [edx], -$A
  add edx, eax
  mov [ecx], edx
end;

which produces (for me - Haswell i7) :

x86 ASM : 2948
div : 3401
Test correctness...
Test complete.

Which is about 18% slower than the SSE version.

With some good ideas from Peter, we can optimize the pure x86 version a bit further, saving a register by converting to a function and replacing the immediate imul with lea and add :

function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal;
const
  m : cardinal = 3435973837;
asm
  mov ecx, eax           {save num to ecx}
  push edx               {save quotient pointer}
  mul m                  {edx:eax = m*Num}
  shr edx, 3             {edx = quotient}
  pop eax                {restore quotient pointer}
  mov [eax], edx         {store quotient}
  mov eax, ecx           {restore num to eax}
  lea ecx, [edx +4*edx]  {ecx = q*5}
  add ecx, ecx           {ecx = q*10}
  sub eax, ecx           {return remainder in eax}
end;

This gets execution time (same conditions as above) down to 2637ms, but still not quite as quick as the SSE version. The imul to lea optimization is minor and optimizes latency over throughput - this can be applied to all algorithms depending on the end use environment.

answered Oct 01 '22 20:10

J...

Related questions
                            
                                Where is the sizing data of TOpenDialog stored and how can this be reset?
                            
                                Delphi Seattle has very big font as default
                            
                                How to concat multiple strings with Move?
                            
                                Is compiler directive WIN32 and CPUX86, WIN64 and CPUX64 interchangeable in Delphi?
                            
                                Delphi unexpected behavior of override
                            
                                CharInSet is much slower than IN, should I fix W1050 warning hint?
                            
                                Must I free a BSTR (WideString) allocated with SysAllocString?
                            
                                When iterating through controls on a form, how can I identify particular buttons?
                            
                                Change TToolBar background color
                            
                                Why form.show is called before form.create in firemonkey?
                            
                                FireDac get the count of rows affected by FDQuery.execsql command
                            
                                Delphi compile error F2048 Bad unit format
                            
                                Resource not found error when using TForm as base for a component
                            
                                Returning result from Windows callback in 64-bit XE6
                            
                                Does Delphi 7 support the inline keyword?
                            
                                How to copy all the pixels in a circle on a BITMAP image and paste to another spot on the image? [closed]
                            
                                Dynamically making indy use TLSv1.2
                            
                                Order by on ClientDataSet doesn't work
                            
                                Delphi cannot remove children from VertScrollBox
                            
                                Is there a compiler setting to control how floating point literals are typed in Delphi?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With