I was answering a question on Code Review and I discovered an interesting difference in performance (like, a lot) between x64 and x86.
class Program
{
static void Main(string[] args)
{
BenchmarkRunner.Run<ModVsOptimization>();
Console.ReadLine();
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static public ulong Mersenne5(ulong dividend)
{
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
dividend = (dividend >> 16) + (dividend & 0xFFFF);
dividend = (dividend >> 8) + (dividend & 0xFF);
dividend = (dividend >> 4) + (dividend & 0xF);
dividend = (dividend >> 4) + (dividend & 0xF);
if (dividend > 14) { dividend = dividend - 15; } // mod 15
if (dividend > 10) { dividend = dividend - 10; }
if (dividend > 4) { dividend = dividend - 5; }
return dividend;
}
}
public class ModVsOptimization
{
[Benchmark(Baseline = true)]
public ulong RawModulo_5()
{
ulong r = 0;
for (ulong i = 0; i < 1000; i++)
{
r += i % 5;
}
return r;
}
[Benchmark]
public ulong OptimizedModulo_ViaMethod_5()
{
ulong r = 0;
for (ulong i = 0; i < 1000; i++)
{
r += Program.Mersenne5(i);
}
return r;
}
}
// * Summary *
BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-5930K CPU 3.50GHz (Broadwell), ProcessorCount=12
Frequency=3415991 Hz, Resolution=292.7408 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.7.2098.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.7.2098.0
Method | Mean | Error | StdDev | Scaled |
---------------------------- |---------:|----------:|----------:|-------:|
RawModulo_5 | 4.601 us | 0.0121 us | 0.0107 us | 1.00 |
OptimizedModulo_ViaMethod_5 | 7.990 us | 0.0060 us | 0.0053 us | 1.74 |
// * Hints *
Outliers
ModVsOptimization.RawModulo_5: Default -> 1 outlier was removed
ModVsOptimization.OptimizedModulo_ViaMethod_5: Default -> 1 outlier was removed
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Scaled : Mean(CurrentBenchmark) / Mean(BaselineBenchmark)
1 us : 1 Microsecond (0.000001 sec)
// ***** BenchmarkRunner: End *****
// * Summary *
BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-5930K CPU 3.50GHz (Broadwell), ProcessorCount=12
Frequency=3415991 Hz, Resolution=292.7408 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2098.0
DefaultJob : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2098.0
Method | Mean | Error | StdDev | Scaled |
---------------------------- |---------:|----------:|----------:|-------:|
RawModulo_5 | 8.323 us | 0.0042 us | 0.0039 us | 1.00 |
OptimizedModulo_ViaMethod_5 | 2.597 us | 0.0956 us | 0.0982 us | 0.31 |
// * Hints *
Outliers
ModVsOptimization.OptimizedModulo_ViaMethod_5: Default -> 2 outliers were removed
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Scaled : Mean(CurrentBenchmark) / Mean(BaselineBenchmark)
1 us : 1 Microsecond (0.000001 sec)
// ***** BenchmarkRunner: End *****
Now here's the part that get's interesting, which doesn't necessarily surprise me (due to the manner in which I especially that the C# compiler works), both the x86 and x64 assemblies have the same IL for the RawModulo_5
method:
.method public hidebysig instance uint64
RawModulo_5() cil managed
{
.custom instance void [BenchmarkDotNet.Core]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor() = ( 01 00 01 00 54 02 08 42 61 73 65 6C 69 6E 65 01 ) // ....T..Baseline.
// Code size 31 (0x1f)
.maxstack 3
.locals init ([0] uint64 r,
[1] uint64 i)
IL_0000: ldc.i4.0
IL_0001: conv.i8
IL_0002: stloc.0
IL_0003: ldc.i4.0
IL_0004: conv.i8
IL_0005: stloc.1
IL_0006: br.s IL_0014
IL_0008: ldloc.0
IL_0009: ldloc.1
IL_000a: ldc.i4.5
IL_000b: conv.i8
IL_000c: rem.un
IL_000d: add
IL_000e: stloc.0
IL_000f: ldloc.1
IL_0010: ldc.i4.1
IL_0011: conv.i8
IL_0012: add
IL_0013: stloc.1
IL_0014: ldloc.1
IL_0015: ldc.i4 0x3e8
IL_001a: conv.i8
IL_001b: blt.un.s IL_0008
IL_001d: ldloc.0
IL_001e: ret
} // end of method ModVsOptimization::RawModulo_5
Now I'm not sure where to look next, but I suspect the issue is somewhere in the JITter, though I tested on RyuJIT and LegacyJIT, both had the same general result with the x64 architecture (though LegacyJIT was slightly slower overall). These are run in Release mode outside of Visual Studio, so I'm assuming there's no attached debugging session to be causing it.
So I'm curious, what is causing this? I have no idea how to investigate further, but if anyone has any ideas on further investigation steps, feel free to comment and I'll gladly try to perform them.
x86 refers to a 32-bit CPU and operating system while x64 refers to a 64-bit CPU and operating system. Does having more amount of bits in each operating system have any benefits? Of course! This is one of the main reasons the number of bits keeps increasing over the years from 16-bits to 64-bits currently.
Is x64 faster than x86? Yes, x64 is faster than x86 systems as they can allocate a lot more RAM and has parallel processing with a more significant 64-bit memory and data bus. It also has larger registers, and the overall performance with 64-bit OS and processors is a lot faster than 32-bit systems.
x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mode.
Whats the Difference? Windows 10 x86 (32-bit) is limited to using 4GB of RAM or less on PCs. Windows 10 x64 (64-bit) can use more than 4GB of RAM and it does this by using the AMD64 standard for 64-bit instructions. This needs the system to be able to support 64bit.
I wanted to do an analysis of the generated assembly code to see what was going on. I grabbed your example code and ran it in Release mode. This is using Visual Studio 2015 with .NET Framework 4.5.2. CPU is an Intel Ivy Bridge i5-3570K, in case the JIT makes very specific optimizations. I ran the same test but without your benchmarking suite, just using a simple Stopwatch
and dividing the time in ticks by the iteration count. Here is what I observed:
RawModulo_5, x86: 13721978 ticks, 13.721978 ticks per iteration
OptimizedModulo_ViaMethod_5, x86: 24641039 ticks, 24.641039 ticks per iteration
RawModulo_5, x64: 23275799 ticks, 23.275799 ticks per iteration
OptimizedModulo_ViaMethod_5, x64: 13389012 ticks, 13.389012 ticks per iteration
This is somewhat different from your measurements - the performance of each method more or less flips depending on x86 versus x64. Your measurements have much more stark differences, particularly between each implementation and its other-arch counterpart. RawModulo_5
is a little less than twice as slow in x64, while OptimizedModulo_ViaMethod_5
is 3.7x faster in x64!
Also, I hope you're not expecting the outputs of RawModulo_5
and OptimizedModulo_ViaMethod_5
to be equal, because they are not! The correct Mersenne5
implementation is below:
static public ulong Mersenne5(ulong dividend)
{
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
dividend = (dividend >> 16) + (dividend & 0xFFFF);
dividend = (dividend >> 8) + (dividend & 0xFF);
dividend = (dividend >> 4) + (dividend & 0xF);
// there was an extra shift by 4 here
if (dividend > 14) { dividend = dividend - 15; } // mod 15
// the 9 used to be a 10
if (dividend > 9) { dividend = dividend - 10; }
if (dividend > 4) { dividend = dividend - 5; }
return dividend;
}
To gather the instructions on my system, I added a System.Diagnostics.Debugger.Break()
within each method, just before the loops and the body of Mersenne5
, so that I'd have a definite break point to grab the generated assembly. By the way, you can grab generated assembly code from the Visual Studio UI - if you're at a breakpoint you can right click the code editor window and select "Go To Disassembly" from the context menu. I've annotated the assembly to explain what it's doing. Sorry for the crazy syntax highlighting.
System.Diagnostics.Debugger.Break();
00242DA2 in al,dx
00242DA3 push edi
00242DA4 push ebx
00242DA5 sub esp,10h
00242DA8 call 6D4C0178
ulong r = 0;
00242DAD mov dword ptr [ebp-10h],0 ; setting the low and high dwords of 'r'
00242DB4 mov dword ptr [ebp-0Ch],0
for (ulong i = 0; i < 1000; i++)
; set the high dword of 'i' to 0
00242DBB mov dword ptr [ebp-14h],0
; clear the low dword of 'i' to 0 - the compiler is using 'edi' as the loop iteration var
00242DC2 xor edi,edi
{
r += i % 5;
00242DC4 mov eax,edi
00242DC6 mov edx,dword ptr [ebp-14h]
; edx:eax together are the high and low dwords of 'i', respectively
; this is a short circuit trick so it can avoid working with the high
; dword - you can see it jumps halfway in to the div/mod operation below
00242DC9 mov ecx,5
00242DCE cmp edx,ecx
00242DD0 jb 00242DDC
; 64 bit div/mod operation
00242DD2 mov ebx,eax
00242DD4 mov eax,edx
00242DD6 xor edx,edx
00242DD8 div eax,ecx
00242DDA mov eax,ebx
00242DDC div eax,ecx
00242DDE mov eax,edx
00242DE0 xor edx,edx
; load the current low and high dwords from 'r', then add into
; edx:eax as a pair forming a qword
00242DE2 add eax,dword ptr [ebp-10h]
00242DE5 adc edx,dword ptr [ebp-0Ch]
; store the result back in 'r'
00242DE8 mov dword ptr [ebp-10h],eax
00242DEB mov dword ptr [ebp-0Ch],edx
for (ulong i = 0; i < 1000; i++)
; load the loop variable low and high dwords into edx:eax
00242DEE mov eax,edi
00242DF0 mov edx,dword ptr [ebp-14h]
; increment eax (the low dword) and propagate any carries to
; edx (the high dword)
00242DF3 add eax,1
00242DF6 adc edx,0
; store the low and high dwords back to the high word of 'i' and
; the loop iteration counter, 'edi'
00242DF9 mov dword ptr [ebp-14h],edx
00242DFC mov edi,eax
; test the high dword
00242DFE cmp dword ptr [ebp-14h],0
00242E02 ja 00242E0E
00242E04 jb 00242DC4
; (int) i < 1000
00242E06 cmp edi,3E8h
00242E0C jb 00242DC4
}
return r;
; retrieve the current value of 'r' from memory, return value is
; in edx:eax since the return value is 64 bits
00242E0E mov eax,dword ptr [ebp-10h]
00242E11 mov edx,dword ptr [ebp-0Ch]
00242E14 lea esp,[ebp-8]
00242E17 pop ebx
00242E18 pop edi
00242E19 pop ebp
00242E1A ret
System.Diagnostics.Debugger.Break();
00242E33 push edi
00242E34 push esi
00242E35 push ebx
00242E36 sub esp,8
00242E39 call 6D4C0178
ulong r = 0;
; same as above, initialize 'r' to zero using low and high dwords
00242E3E mov dword ptr [ebp-10h],0
; this time we're using edi:esi as the loop counter, rather than
; edi and a memory location. probably less register pressure in this
; function, for reasons we'll see...
00242E45 xor ebx,ebx
for (ulong i = 0; i < 1000; i++)
; initialize 'i' to 0, esi is the loop counter low dword, edi is the high dword
00242E47 xor esi,esi
00242E49 xor edi,edi
; push 'i' to the stack, high word then low word
00242E4B push edi
00242E4C push esi
; call Mersenne5 - it got put in the data section since it's static
00242E4D call dword ptr ds:[3D7830h]
; return value comes back as edx:eax, where edx is the high dword
; ebx is the existing low dword of 'r', so it's accumulated into eax
00242E53 add eax,ebx
; the high dword of 'r' is at ebp-10, that gets accumulated to edx with
; the carry result of the last add since it's 64 bits wide
00242E55 adc edx,dword ptr [ebp-10h]
; store edx:ebx back to 'r'
00242E58 mov dword ptr [ebp-10h],edx
00242E5B mov ebx,eax
; increment the loop counter and carry to edi as well, 64 bit add
00242E5D add esi,1
00242E60 adc edi,0
; make sure edi == 0 since it's the high dword
00242E63 test edi,edi
00242E65 ja 00242E71
00242E67 jb 00242E4B
; (int) i < 1000
00242E69 cmp esi,3E8h
00242E6F jb 00242E4B
}
return r;
; move 'r' to edx:eax to return them
00242E71 mov eax,ebx
00242E73 mov edx,dword ptr [ebp-10h]
00242E76 lea esp,[ebp-0Ch]
00242E79 pop ebx
00242E7A pop esi
00242E7B pop edi
00242E7C pop ebp
00242E7D ret
System.Diagnostics.Debugger.Break();
00342E92 in al,dx
00342E93 push edi
00342E94 push esi
; esi is the low dword, edi is the high dword of the 64 bit argument
00342E95 mov esi,dword ptr [ebp+8]
00342E98 mov edi,dword ptr [ebp+0Ch]
00342E9B call 6D4C0178
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
; this is a LOT of instructions for each step, but at least it's all registers.
; copy edi:esi to edx:eax
00342EA0 mov eax,esi
00342EA2 mov edx,edi
; clobber eax with edx, so now both are the high word. this is a
; shorthand for a 32 bit shift right of a 64 bit number.
00342EA4 mov eax,edx
; clear the high word now that we've moved the high word to the low word
00342EA6 xor edx,edx
; clear the high word of the original 'dividend', same as masking the low 32 bits
00342EA8 xor edi,edi
; (dividend >> 32) + (dividend & 0xFFFFFFFF)
; it's a 64 bit add, so it's the usual add/adc
00342EAA add eax,esi
00342EAC adc edx,edi
; 'dividend' now equals the temporary "variable" that held the addition result
00342EAE mov esi,eax
00342EB0 mov edi,edx
dividend = (dividend >> 16) + (dividend & 0xFFFF);
; same idea as above, but with an actual shift and mask since it's not 32 bits wide
00342EB2 mov eax,esi
00342EB4 mov edx,edi
00342EB6 shrd eax,edx,10h
00342EBA shr edx,10h
00342EBD and esi,0FFFFh
00342EC3 xor edi,edi
00342EC5 add eax,esi
00342EC7 adc edx,edi
00342EC9 mov esi,eax
00342ECB mov edi,edx
dividend = (dividend >> 8) + (dividend & 0xFF);
; same idea, keep going down...
00342ECD mov eax,esi
00342ECF mov edx,edi
00342ED1 shrd eax,edx,8
00342ED5 shr edx,8
00342ED8 and esi,0FFh
00342EDE xor edi,edi
00342EE0 add eax,esi
00342EE2 adc edx,edi
00342EE4 mov esi,eax
00342EE6 mov edi,edx
dividend = (dividend >> 4) + (dividend & 0xF);
00342EE8 mov eax,esi
00342EEA mov edx,edi
00342EEC shrd eax,edx,4
00342EF0 shr edx,4
00342EF3 and esi,0Fh
00342EF6 xor edi,edi
00342EF8 add eax,esi
00342EFA adc edx,edi
00342EFC mov esi,eax
00342EFE mov edi,edx
dividend = (dividend >> 4) + (dividend & 0xF);
00342F00 mov eax,esi
00342F02 mov edx,edi
00342F04 shrd eax,edx,4
00342F08 shr edx,4
00342F0B and esi,0Fh
00342F0E xor edi,edi
00342F10 add eax,esi
00342F12 adc edx,edi
00342F14 mov esi,eax
00342F16 mov edi,edx
if (dividend > 14) { dividend = dividend - 15; } // mod 15
; conditional subtraction
00342F18 test edi,edi
00342F1A ja 00342F23
00342F1C jb 00342F29
; 'dividend' > 14
00342F1E cmp esi,0Eh
00342F21 jbe 00342F29
; 'dividend' = 'dividend' - 15
00342F23 sub esi,0Fh
; subtraction borrow from high word
00342F26 sbb edi,0
if (dividend > 10) { dividend = dividend - 10; }
; same gist for the next two
00342F29 test edi,edi
00342F2B ja 00342F34
00342F2D jb 00342F3A
00342F2F cmp esi,0Ah
00342F32 jbe 00342F3A
00342F34 sub esi,0Ah
00342F37 sbb edi,0
if (dividend > 4) { dividend = dividend - 5; }
00342F3A test edi,edi
00342F3C ja 00342F45
00342F3E jb 00342F4B
00342F40 cmp esi,4
00342F43 jbe 00342F4B
00342F45 sub esi,5
00342F48 sbb edi,0
return dividend;
; move edi:esi into edx:eax for return
00342F4B mov eax,esi
00342F4D mov edx,edi
00342F4F pop esi
00342F50 pop edi
00342F51 pop ebp
00342F52 ret 8
The first big thing I notice is that Mersenne5 is not actually getting inlined, even though it's listed tagged as AggressiveInlining
. I'm guessing this is because inlining the function inside OptimizedModulo_ViaMethod_5
would cause horrific register spilling, and the large amount of memory reads and writes would completely destroy the point of inlining the method, so the compiler elected (quite wisely!) not to do so.
Second, Mersenne5 is getting call
'd 1000 times by OptimizedModulo_ViaMethod_5
, so there's 1000 pieces of extra call/ret overhead being experienced, including the necessary pushes and pops to save register states across the call boundary. RawModulo_5
doesn't make any calls outside, and even the 64 bit division is optimized a bit so it skips the high dword
where it can.
System.Diagnostics.Debugger.Break();
000007FE98C93CF0 sub rsp,28h
000007FE98C93CF4 call 000007FEF7B079C0
ulong r = 0;
; the compiler knows the high dword of rcx is already 0, so it just
; zeros the low dword. this is 'r'
000007FE98C93CF9 xor ecx,ecx
for (ulong i = 0; i < 1000; i++)
; same here, this is 'i'
000007FE98C93CFB xor r8d,r8d
{
r += i % 5;
; load 5 as a dword to the low dword of r9
000007FE98C93CFE mov r9d,5
; copy the loop counter to rax for the div below
000007FE98C93D04 mov rax,r8
; clear the lower dword of rdx, upper dword is clear already
000007FE98C93D07 xor edx,edx
; 64 bit div/mod in one instruction! but it's slow!
000007FE98C93D09 div rax,r9
; rax = quotient, rdx = remainder
; throw away the quotient since we're just doing mod, and accumulate the
; modulus into 'r'
000007FE98C93D0C add rcx,rdx
for (ulong i = 0; i < 1000; i++)
; 64 bit increment to the loop counter
000007FE98C93D0F inc r8
; i < 1000
000007FE98C93D12 cmp r8,3E8h
000007FE98C93D19 jb 000007FE98C93CFE
}
return r;
; return 'r' in rax, since we can directly return a 64 bit var in one register now
000007FE98C93D1B mov rax,rcx
000007FE98C93D1E add rsp,28h
000007FE98C93D22 ret
System.Diagnostics.Debugger.Break();
000007FE98C94040 push rdi
000007FE98C94041 push rsi
000007FE98C94042 sub rsp,28h
000007FE98C94046 call 000007FEF7B079C0
ulong r = 0;
; same general loop setup as above
000007FE98C9404B xor esi,esi
for (ulong i = 0; i < 1000; i++)
; 'edi' is the loop counter
000007FE98C9404D xor edi,edi
; put rdi in rcx, which is the x64 register used for the first argument
; in a call
000007FE98C9404F mov rcx,rdi
; call Mersenne5 - still no actual inlining!
000007FE98C94052 call 000007FE98C90F40
; accumulate 'r' with the return value of Mersenne5
000007FE98C94057 add rax,rsi
; store back to 'r' - I don't know why in the world the compiler did this
; seems like add rsi, rax would be better, but maybe there's a pipelining
; issue I'm not seeing.
000007FE98C9405A mov rsi,rax
; increment loop counter
000007FE98C9405D inc rdi
; i < 1000
000007FE98C94060 cmp rdi,3E8h
000007FE98C94067 jb 000007FE98C9404F
}
return r;
; put return value in rax like before
000007FE98C94069 mov rax,rsi
000007FE98C9406C add rsp,28h
000007FE98C94070 pop rsi
000007FE98C94071 pop rdi
000007FE98C94072 ret
System.Diagnostics.Debugger.Break();
000007FE98C94580 push rsi
000007FE98C94581 sub rsp,20h
000007FE98C94585 mov rsi,rcx
000007FE98C94588 call 000007FEF7B079C0
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
; pretty similar to before actually, except this time we do a real
; shift and mask for the 32 bit part
000007FE98C9458D mov rax,rsi
; 'dividend' >> 32
000007FE98C94590 shr rax,20h
; hilariously, we have to load the mask into edx first. this is because
; there is no AND r/64, imm64 in x64
000007FE98C94594 mov edx,0FFFFFFFFh
000007FE98C94599 and rsi,rdx
; add the shift and the masked versions together
000007FE98C9459C add rax,rsi
000007FE98C9459F mov rsi,rax
dividend = (dividend >> 16) + (dividend & 0xFFFF);
; same logic continues down
000007FE98C945A2 mov rax,rsi
000007FE98C945A5 shr rax,10h
000007FE98C945A9 mov rdx,rsi
000007FE98C945AC and rdx,0FFFFh
000007FE98C945B3 add rax,rdx
; note the redundant moves that happen every time, rax into rsi, rsi
; into rax. so there's still not ideal x64 being generated.
000007FE98C945B6 mov rsi,rax
dividend = (dividend >> 8) + (dividend & 0xFF);
000007FE98C945B9 mov rax,rsi
000007FE98C945BC shr rax,8
000007FE98C945C0 mov rdx,rsi
000007FE98C945C3 and rdx,0FFh
000007FE98C945CA add rax,rdx
000007FE98C945CD mov rsi,rax
dividend = (dividend >> 4) + (dividend & 0xF);
000007FE98C945D0 mov rax,rsi
000007FE98C945D3 shr rax,4
000007FE98C945D7 mov rdx,rsi
000007FE98C945DA and rdx,0Fh
000007FE98C945DE add rax,rdx
000007FE98C945E1 mov rsi,rax
dividend = (dividend >> 4) + (dividend & 0xF);
000007FE98C945E4 mov rax,rsi
000007FE98C945E7 shr rax,4
000007FE98C945EB mov rdx,rsi
000007FE98C945EE and rdx,0Fh
000007FE98C945F2 add rax,rdx
000007FE98C945F5 mov rsi,rax
if (dividend > 14) { dividend = dividend - 15; } // mod 15
; notice the difference in jumping logic - the pairs of jumps are now singles
000007FE98C945F8 cmp rsi,0Eh
000007FE98C945FC jbe 000007FE98C94602
; using a single 64 bit add instead of a subtract, the immediate constant
; is the 2's complement of 15. this is okay because there's no borrowing
; to do since we can do the entire sub in one operation to one register.
000007FE98C945FE add rsi,0FFFFFFFFFFFFFFF1h
if (dividend > 10) { dividend = dividend - 10; }
000007FE98C94602 cmp rsi,0Ah
000007FE98C94606 jbe 000007FE98C9460C
000007FE98C94608 add rsi,0FFFFFFFFFFFFFFF6h
if (dividend > 4) { dividend = dividend - 5; }
000007FE98C9460C cmp rsi,4
000007FE98C94610 jbe 000007FE98C94616
000007FE98C94612 add rsi,0FFFFFFFFFFFFFFFBh
return dividend;
000007FE98C94616 mov rax,rsi
000007FE98C94619 add rsp,20h
000007FE98C9461D pop rsi
000007FE98C9461E ret
RawModulo_5
is twice as slow in x64 compared to x86, and especially why OptimizedModulo_ViaMethod_5
is almost four times faster under x64 than x86. To get a full explanation I think we'd need someone like Peter Cordes - he's far more knowledgeable than I am with regard to instruction timings and pipelining. Here are my intuitions as to where the advantages and disadvantages are coming from.
[x64 con] div
in x86 versus x64 as it concerns RawModulo_5
According to the instruction tables provided by Agner Fog here, on Broadwell a 32 bit div
takes 10 micro-ops and has a latency of 22 to 29 clocks, while 64 bit div
takes 36 micro-ops and has a latency of 32 to 95 clocks.
The compiler also made an optimization in x86 RawModulo_5
that bypasses the high dword div
in every case, since the loop stays below int.MaxValue
, so in reality it's just doing a single 32 bit div
on each iteration. Thus, the 64 bit div
latency is between 1.45 and 3.27 times higher than the 32 bit div
latency. Both versions have total dependencies on the results of the div
, so the x64 code is paying a much larger performance penalty because of the higher latency. I would venture that the pair of add/adc instructions for 64 bit adds in x86 RawModulo_5
are a tiny penalty versus the huge performance disadvantage of the 64 bit wide div
.
[x64 pro] Reduced call overhead in x64 OptimizedModulo_ViaMethod_5
This is probably not a huge difference in terms of performance, but it's worth mentioning. Because OptimizedModulo_ViaMethod_5
is calling Mersenne5
1000 times in both versions, the 64 bit version is paying far less a penalty in terms of the standard x86 versus x64 calling convention. Consider that the x86 version has to push two registers to the stack to pass a 64 bit variable, then Mersenne5
has to preserve esi
and edi
, then pull the high and low dwords out of the stack for edx
and eax
respectively. At the end, Mersenne5
has to restore esi
and edi
. In the x64 version, the value of i
is passed in ecx
directly, so no memory access is involved at all. The x64 Mersenne5
only saves and restores rsi
, the other registers are clobbered.
[x64 pro] Many fewer instructions in x64 Mersenne5
Mersenne5
is more efficient in x64 as it can perform all the operations on the 64 bit dividend
in single instructions, versus requiring pairs of instructions in x86 for the mov
and add/adc
operations. I have a hunch that the dependency chains are better in x64 as well, but I am not knowledgeable enough to speak on that subject.
[x64 pro] Better jump behavior in x64 Mersenne5
The three conditional subtractions that Mersenne5
does at the end are implemented much better under x64 than x86. On x86, each one has two comparisons and three possible conditional jumps that can be taken. On x64, there is only one comparison and one conditional jump, which is undoubtedly more efficient.
With those points in mind, it makes some sense for Ivy Bridge we'd see the performance of each flip-flop from x86 to x64. It's likely that the 64 bit division latency penalty (which is a little worse on Ivy Bridge than Broadwell, but not much) is hurting RawModulo_5
quite a bit, and the near halving of instructions in Mersenne5
is speeding up OptimizedModulo_ViaMethod_5
at the same time.
What doesn't make sense is the results on Broadwell - I'm still a little surprised how much faster the x64 OptimizedModulo_ViaMethod_5
is, even compared to the x86 RawModulo_5
. I imagine the answer would be micro-op fusion and pipelining for the Mersenne5
method is considerably better on x64, or perhaps the JIT on your architecture is using Broadwell-specific knowledge to output very different instructions.
I'm sorry I can't give a more conclusive answer, but I hope the analysis above is enlightening as to why there's a difference between the two methods and the two architectures.
By the way, if you want to see what a truly inlined version can do, here you go:
RawModulo_5, x86: 13722506 ticks, 13.722506 ticks per iteration
OptimizedModulo_ViaMethod_5, x86: 23640994 ticks, 23.640994 ticks per iteration
OptimizedModulo_TrueInlined, x86: 21488012 ticks, 21.488012 ticks per iteration
OptimizedModulo_TrueInlined2, x86: 21645697 ticks, 21.645697 ticks per iteration
RawModulo_5, x64: 22175326 ticks, 22.175326 ticks per iteration
OptimizedModulo_ViaMethod_5, x64: 12822574 ticks, 12.822574 ticks per iteration
OptimizedModulo_TrueInlined, x64: 7612328 ticks, 7.612328 ticks per iteration
OptimizedModulo_TrueInlined2, x64: 7591190 ticks, 7.59119 ticks per iteration
And the code:
public ulong OptimizedModulo_TrueInlined()
{
ulong r = 0;
ulong dividend = 0;
for (ulong i = 0; i < 1000; i++)
{
dividend = i;
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
dividend = (dividend >> 16) + (dividend & 0xFFFF);
dividend = (dividend >> 8) + (dividend & 0xFF);
dividend = (dividend >> 4) + (dividend & 0xF);
dividend = (dividend >> 4) + (dividend & 0xF);
if (dividend > 14) { dividend = dividend - 15; } // mod 15
if (dividend > 10) { dividend = dividend - 10; }
if (dividend > 4) { dividend = dividend - 5; }
r += dividend;
}
return r;
}
public ulong OptimizedModulo_TrueInlined2()
{
ulong r = 0;
ulong dividend = 0;
for (ulong i = 0; i < 1000; i++)
{
dividend = (i >> 32) + (i & 0xFFFFFFFF);
dividend = (dividend >> 16) + (dividend & 0xFFFF);
dividend = (dividend >> 8) + (dividend & 0xFF);
dividend = (dividend >> 4) + (dividend & 0xF);
dividend = (dividend >> 4) + (dividend & 0xF);
if (dividend > 14) { dividend = dividend - 15; } // mod 15
if (dividend > 10) { dividend = dividend - 10; }
if (dividend > 4) { dividend = dividend - 5; }
r += dividend;
}
return r;
}
r += i % 5;
This is the bottleneck statement in the code snippet, as explained well by @ozeanix. I'll annotate his extensive answer.
Division is one of the hard operations a processor has to perform, there is no known digital circuit that can execute division in a single cycle. It has to be implemented with an iterative approach, not fundamentally different from the way you learned to do it in elementary school. Execution time is proportional to the number of bits, a 64-bit division can be expected to be twice as slow as a 32-bit division.
The x86 jitter, having to generate the cumbersome code to do the math with only 32-bit registers, took a shortcut for the case where the upper 32-bits of the ulong
are 0. That turned out well in this specific case, 999 and 5 are small enough. Do note how much faster the 64-bit code is on the Mersenne5() method, being able to use a single register to store intermediate values and a single shift instruction to move 64-bits at a time gives it a big leg up.
The x64 jitter cannot use the same trick the x86 jitter uses, not without making the code slower, the upper 32-bits of a 64-bit register are not directly addressable. That does not mean that you are stuck with the slower perf, with sufficient trust any pig can be made to fly. I'll show a coding trick that I reverse-engineered from a C compiler optimizer. It works in this specific case because you repeatedly use the same divisor. Just to illustrate the trick, this is the machine code that such a compiler generates in its inner loop with loop unrolling and instruction mixing removed:
00007FF603121006 mov rax,0CCCCCCCCCCCCCCCDh ; magic!
00007FF603121010 mul rax,r9 ; magic * i
00007FF603121013 shr rdx,2 ; rdx = (magic * i) / 4 / 2^64
00007FF603121017 lea rcx,[rdx+rdx*4] ; 5 * rdx
00007FF60312101B mov rdx,r9 ; i
00007FF60312101E sub rdx,rcx ; i - 5 * ((magic * i) / 4 / 2^64)
00007FF603121024 add r8,rdx ; r += i % 5
This is, cough, hard to make sense of. Key point is that the code does not use the DIV instruction at all, but can do it with SHR, that makes it very fast. SHR is the exact equivalent of the >>
operator in C#, right-shifting is equivalent to dividing by powers of 2.
The big trick is to transform a division by 5 into a division by a power of 2. This is not in general possible, but it can be approximated. It takes some rewriting tricks to see that. It starts with the identity that transforms modulo into division:
A % B == A - B * (A / B)
Transform the division by multiplying the left and right side by N/B where N is a convenient power of 2:
A % B == A - B * ((A * N / B) / N)
Since N / B is known up front it can be hoisted out of the loop. I should emphasize that this identity is only valid for floating point division. We want to use integer division instead. Thus:
A % B ~= A - B * (A * K / N) where K ~= N / B
The approximation for K is the more accurate the larger a value we pick for N. The C compiler code uses a very large value for N, 4 * 2^64, taking advantage of a 64-bit multiplication producing a 128-bit result. Something we cannot do in C#, we have to pick a value for N that is small enough so the result never overflows. Encoding this approach in a helper class:
public class FastModulo {
public FastModulo(ulong maxdividend, ulong divisor) {
div = divisor;
int dividendbits = 1 + (int)(Math.Log(maxdividend - 1) / Math.Log(2));
shift = 64 - dividendbits;
mult = (ulong)Math.Round((double)(1UL << shift) / divisor);
//TODO: verify that the approximation is accurate enough.
}
public ulong Modulo(ulong value) {
return value - (div * ((value * mult) >> shift));
}
int shift;
ulong mult, div;
}
And using it:
public ulong RawModulo_5() {
var fm = new FastModulo(1000, 5);
ulong r = 0;
for (uint i = 0; i < 1000; i++) {
r += fm.Modulo(i);
}
}
Or the less readable:
r += i - (5 * ((i * 3602879701896397UL) >> 54));
It is quite a bit faster in 64-bit mode (don't use in 32), I see a rough x3 improvement on my mobile Haswell. Achieved by replacing the expensive multi-cycle division by 2 multiplies, a shift and a subtraction. Each only taking 1 cycle.
There is a //TODO, it needs a check to verify that the approximation does not cause errors when the dividend or divisor get too large. Not 100% sure how to do this correctly, modular math gives me a headache. But I'm sure most programmers consider this a curiosity instead of practical code :) If somebody wants to dig in then please edit the code to add the check, otherwise just run the code both ways to verify that the result is the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With