I already knew that setting a field is much slower than setting a local variable, but it also appears that setting a field with a local variable is much slower than setting a local variable with a field. Why is this? In either case the address of the field is used.
public class Test
{
public int A = 0;
public int B = 4;
public void Method1() // Set local with field
{
int a = A;
for (int i = 0; i < 100; i++)
{
a += B;
}
A = a;
}
public void Method2() // Set field with local
{
int b = B;
for (int i = 0; i < 100; i++)
{
A += b;
}
}
}
The benchmark results with 10e+6 iterations are:
Method1: 28.1321 ms Method2: 162.4528 ms
Running this on my machine, I get similar time differences, however looking at the JITted code for 10M iterations, it's clear to see why this is the case:
Method A:
mov r8,rcx
; "A" is loaded into eax
mov eax,dword ptr [r8+8]
xor edx,edx
; "B" is loaded into ecx
mov ecx,dword ptr [r8+0Ch]
nop dword ptr [rax]
loop_start:
; Partially unrolled loop, all additions done in registers
add eax,ecx
add eax,ecx
add eax,ecx
add eax,ecx
add edx,4
cmp edx,989680h
jl loop_start
; Store the sum in eax back to "A"
mov dword ptr [r8+8],eax
ret
And Method B:
; "B" is loaded into edx
mov edx,dword ptr [rcx+0Ch]
xor r8d,r8d
nop word ptr [rax+rax]
loop_start:
; Partially unrolled loop, but each iteration requires reading "A" from memory
; adding "B" to it, and then writing the new "A" back to memory.
mov eax,dword ptr [rcx+8]
add eax,edx
mov dword ptr [rcx+8],eax
mov eax,dword ptr [rcx+8]
add eax,edx
mov dword ptr [rcx+8],eax
mov eax,dword ptr [rcx+8]
add eax,edx
mov dword ptr [rcx+8],eax
mov eax,dword ptr [rcx+8]
add eax,edx
mov dword ptr [rcx+8],eax
add r8d,4
cmp r8d,989680h
jl loop_start
rep ret
As you can see from the assembly, Method A is going to be significantly faster since the values of A and B are both put in registers, and all of the additions occur there with no intermediate writes to memory. Method B on the other hand incurs a load and store to "A" in memory for every single iteration.
In case 1 a
is clearly stored in a register. Anything else would be a horrible compilation result.
Probably, the .NET JIT is not willing/able to convert the stores to A
to register stores in case 2.
I doubt this is forced by the .NET memory model because other threads can never tell the difference between your two methods if they only observe A
to be 0 or the sum. They cannot disprove the theory that the optimization never happened. That makes it allowed under the semantics of the .NET abstract machine.
It is not suprising to see the .NET JIT perform little optimizations. This is well known to followers of the performance
tag on Stack Overflow.
I know from experience that the JIT is much more likely to cache memory loads in registers. That's why case 1 (apparently) does not access B
with each iteration.
Register computations are cheaper that memory accesses. This is even true if the memory in question is in the CPU L1 cache (as it is the case here).
I thought only locals were eligible for CPU caching?
This cannot be so because the CPU does not even know what a local is. All addresses look the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With