How to explain the difference of performance in these 2 simple loops?

Question

For those who are interested in how I do the benchmark, look here , I simple replace / add a couple of methods near the build in "Loop 1K" method.

Sorry, I forgot to say my testing environment. .Net 4.5 x64 (don't pick 32bit preferred). in x86 both methods take 5x as much as time.

Loop2 takes 3x as much time as Loop. I thought that x++ / x+=y should not slow down when x gets larger (since it takes 1 or 2 cpu instructions any way)

Is it due to Locality of reference? However I thought that in Loop2 there are not many variables, they should all be close to each other...

    public long Loop(long testSize)
    {
        long ret = 0;
        for (long i = 0; i < testSize; i++)
        {
            long p = 0;
            for (int j = 0; j < 1000; j++)
            {
                p+=10;
            }
            ret+=p;
        }
        return ret;
    }

    public long Loop2(long testSize)
    {
        long ret = 0;
        for (long i = 0; i < testSize; i++)
        {
            for (int j = 0; j < 1000; j++)
            {
                ret+=10;
            }
        }
        return ret;
    }

Update: When, if ever, is loop unrolling still useful? is useful

Christopher Currens · Accepted Answer

It's been said before a few times that the x86 JIT does a better job than the x64 JIT when it comes to optimization and it looks like that is what is happening in this case. Although the loops are performing essentially the same thing, the x64 assembly code that the JITer is creating are fundamentally different, and I think it accounts for the speed difference you're seeing.

The assembly code between the two methods differ in the critical inner loop, which is called 1000*N times. This is what I think is accounting for the speed difference.

Loop 1:

000007fe`97d50240 4d8bd1          mov     r10,r9 
000007fe`97d50243 4983c128        add     r9,28h 
000007fe`97d50247 4183c004        add     r8d,4  
; Loop while j < 1000d
000007fe`97d5024b 4181f8e8030000  cmp     r8d,3E8h
000007fe`97d50252 7cec            jl      000007fe`97d50240

Loop 2:


; rax = ret
; ecx = j

; Add 10 to ret 4 times
000007fe`97d50292 48050a000000    add     rax,0Ah
000007fe`97d50298 48050a000000    add     rax,0Ah
000007fe`97d5029e 48050a000000    add     rax,0Ah
000007fe`97d502a4 48050a000000    add     rax,0Ah
000007fe`97d502aa 83c104          add     ecx,4    ; increment j by 4

; Loop while j < 1000d
000007fe`97d502ad 81f9e8030000    cmp     ecx,3E8h
000007fe`97d502b3 7cdd            jl      000007fe`97d50292

You'll notice that the JIT is unrolling the inner loop, but the actual code in the loop differs greatly when it comes to the number of instructions made. Loop 1 is optimized to make a single add statement of 40, where Loop 2 makes 4 add statements of 10.

My (wild) guess is that the JITer can better optimize the variable p because it is defined in the inner scope of the first loop. Since it can detect that p is never used outside of that loop and is truly temporary, it can apply different optimizations. In the second loop, you're acting on a variable that is defined and used outside of the scope of both loops, and the optimization rules used in the x64 JIT doesn't recognize it as the same code that could have the same optimizations.

Kirk Woll · Answer

I am not seeing any appreciable difference in performance. Using this LinqPad script (and including those two methods of yours):

void Main()
{
    // Warmup the vm
    Loop(10);
    Loop2(10);

    var stopwatch = Stopwatch.StartNew();
    Loop(10 * 1000 * 1000);
    stopwatch.Stop();
    stopwatch.Elapsed.Dump();

    stopwatch = Stopwatch.StartNew();
    Loop2(10 * 1000 * 1000);
    stopwatch.Stop();
    stopwatch.Elapsed.Dump();
}

Prints out (in LinqPad);

00:00:22.7749976
00:00:22.6971114

When reversing the order of the Loop / Loop2 calls, the results are similar:

00:00:22.7572688
00:00:22.6758102

This seems to indicate that the performance is the same. Perhaps you didn't warm up the VM?

Cheese · Answer

Loop should be faster than Loop2, the only explanation that comes to my mind is that compiler optimizing kicks in and reduces the long p = 0; for (int j = 0; j < 1000; j++) { p++; } to somthing like long p = 1000;, checking the generated assembler code would bring clarity.

How to explain the difference of performance in these 2 simple loops?

Tags:

performance

c#

colinfang

3 Answers

Christopher Currens

Kirk Woll

Cheese

Recent Activity

Donate For Us

How to explain the difference of performance in these 2 simple loops?

Tags:

performance

c#

colinfang

3 Answers

Christopher Currens

Kirk Woll

Cheese

Related questions

Recent Activity

Donate For Us