For those who are interested in how I do the benchmark, look here , I simple replace / add a couple of methods near the build in "Loop 1K" method.
Sorry, I forgot to say my testing environment. .Net 4.5 x64 (don't pick 32bit preferred). in x86 both methods take 5x as much as time.
Loop2
takes 3x as much time as Loop
. I thought that x++
/ x+=y
should not slow down when x
gets larger (since it takes 1 or 2 cpu instructions any way)
Is it due to Locality of reference? However I thought that in Loop2
there are not many variables, they should all be close to each other...
public long Loop(long testSize)
{
long ret = 0;
for (long i = 0; i < testSize; i++)
{
long p = 0;
for (int j = 0; j < 1000; j++)
{
p+=10;
}
ret+=p;
}
return ret;
}
public long Loop2(long testSize)
{
long ret = 0;
for (long i = 0; i < testSize; i++)
{
for (int j = 0; j < 1000; j++)
{
ret+=10;
}
}
return ret;
}
Update: When, if ever, is loop unrolling still useful? is useful
It's been said before a few times that the x86 JIT does a better job than the x64 JIT when it comes to optimization and it looks like that is what is happening in this case. Although the loops are performing essentially the same thing, the x64 assembly code that the JITer is creating are fundamentally different, and I think it accounts for the speed difference you're seeing.
The assembly code between the two methods differ in the critical inner loop, which is called 1000*N times. This is what I think is accounting for the speed difference.
Loop 1:
000007fe`97d50240 4d8bd1 mov r10,r9 000007fe`97d50243 4983c128 add r9,28h 000007fe`97d50247 4183c004 add r8d,4 ; Loop while j < 1000d 000007fe`97d5024b 4181f8e8030000 cmp r8d,3E8h 000007fe`97d50252 7cec jl 000007fe`97d50240
Loop 2:
; rax = ret ; ecx = j ; Add 10 to ret 4 times 000007fe`97d50292 48050a000000 add rax,0Ah 000007fe`97d50298 48050a000000 add rax,0Ah 000007fe`97d5029e 48050a000000 add rax,0Ah 000007fe`97d502a4 48050a000000 add rax,0Ah 000007fe`97d502aa 83c104 add ecx,4 ; increment j by 4 ; Loop while j < 1000d 000007fe`97d502ad 81f9e8030000 cmp ecx,3E8h 000007fe`97d502b3 7cdd jl 000007fe`97d50292
You'll notice that the JIT is unrolling the inner loop, but the actual code in the loop differs greatly when it comes to the number of instructions made. Loop 1 is optimized to make a single add statement of 40, where Loop 2 makes 4 add statements of 10.
My (wild) guess is that the JITer can better optimize the variable p
because it is defined in the inner scope of the first loop. Since it can detect that p
is never used outside of that loop and is truly temporary, it can apply different optimizations. In the second loop, you're acting on a variable that is defined and used outside of the scope of both loops, and the optimization rules used in the x64 JIT doesn't recognize it as the same code that could have the same optimizations.
I am not seeing any appreciable difference in performance. Using this LinqPad script (and including those two methods of yours):
void Main()
{
// Warmup the vm
Loop(10);
Loop2(10);
var stopwatch = Stopwatch.StartNew();
Loop(10 * 1000 * 1000);
stopwatch.Stop();
stopwatch.Elapsed.Dump();
stopwatch = Stopwatch.StartNew();
Loop2(10 * 1000 * 1000);
stopwatch.Stop();
stopwatch.Elapsed.Dump();
}
Prints out (in LinqPad);
00:00:22.7749976
00:00:22.6971114
When reversing the order of the Loop
/ Loop2
calls, the results are similar:
00:00:22.7572688
00:00:22.6758102
This seems to indicate that the performance is the same. Perhaps you didn't warm up the VM?
Loop should be faster than Loop2, the only explanation that comes to my mind is that compiler optimizing kicks in and reduces the long p = 0;
for (int j = 0; j < 1000; j++)
{
p++;
}
to somthing like long p = 1000;
, checking the generated assembler code would bring clarity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With