I'm processing lots of data in a 3D grid so I wanted to implement a simple iterator instead of three nested loops. However, I encountered a performance problem: first, I implemented a simple loop using only int x, y and z variables. Then I implemented a Vector3I structure and used that - and the calculation time doubled. Now I'm struggling with the question - why is that? What did I do wrong?
Example for reproduction:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
public struct Vector2I
{
public int X;
public int Y;
public int Z;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public Vector2I(int x, int y, int z)
{
this.X = x;
this.Y = y;
this.Z = z;
}
}
public class IterationTests
{
private readonly int _countX;
private readonly int _countY;
private readonly int _countZ;
private Vector2I _Vector = new Vector2I(0, 0, 0);
public IterationTests()
{
_countX = 64;
_countY = 64;
_countZ = 64;
}
[Benchmark]
public void NestedLoops()
{
int countX = _countX;
int countY = _countY;
int countZ = _countZ;
int result = 0;
for (int x = 0; x < countX; ++x)
{
for (int y = 0; y < countY; ++y)
{
for (int z = 0; z < countZ; ++z)
{
result += ((x ^ y) ^ (~z));
}
}
}
}
[Benchmark]
public void IteratedVariables()
{
int countX = _countX;
int countY = _countY;
int countZ = _countZ;
int result = 0;
int x = 0, y = 0, z = 0;
while (true)
{
result += ((x ^ y) ^ (~z));
++z;
if (z >= countZ)
{
z = 0;
++y;
if (y >= countY)
{
y = 0;
++x;
if (x >= countX)
{
break;
}
}
}
}
}
[Benchmark]
public void IteratedVector()
{
int countX = _countX;
int countY = _countY;
int countZ = _countZ;
int result = 0;
Vector2I iter = new Vector2I(0, 0, 0);
while (true)
{
result += ((iter.X ^ iter.Y) ^ (~iter.Z));
++iter.Z;
if (iter.Z >= countZ)
{
iter.Z = 0;
++iter.Y;
if (iter.Y >= countY)
{
iter.Y = 0;
++iter.X;
if (iter.X >= countX)
{
break;
}
}
}
}
}
[Benchmark]
public void IteratedVectorAvoidNew()
{
int countX = _countX;
int countY = _countY;
int countZ = _countZ;
int result = 0;
Vector2I iter = _Vector;
iter.X = 0;
iter.Y = 0;
iter.Z = 0;
while (true)
{
result += ((iter.X ^ iter.Y) ^ (~iter.Z));
++iter.Z;
if (iter.Z >= countZ)
{
iter.Z = 0;
++iter.Y;
if (iter.Y >= countY)
{
iter.Y = 0;
++iter.X;
if (iter.X >= countX)
{
break;
}
}
}
}
}
}
public static class Program
{
public static void Main(string[] args)
{
BenchmarkRunner.Run<IterationTests>();
}
}
What I measured:
Method | Mean | Error | StdDev |
----------------------- |---------:|----------:|----------:|
NestedLoops | 333.9 us | 4.6837 us | 4.3811 us |
IteratedVariables | 291.0 us | 0.8792 us | 0.6864 us |
IteratedVector | 702.1 us | 4.8590 us | 4.3073 us |
IteratedVectorAvoidNew | 725.8 us | 6.4850 us | 6.0661 us |
Note: the 'IteratedVectorAvoidNew' is there due to discussion that the problem might lie in the new
operator of Vector3I - originally, I used a custom iteration loop and measured with a stopwatch.
Additionally, a benchmark of when I iterate over a 256×256×256 area:
Method | Mean | Error | StdDev |
----------------------- |---------:|----------:|----------:|
NestedLoops | 18.67 ms | 0.0504 ms | 0.0446 ms |
IteratedVariables | 18.80 ms | 0.2006 ms | 0.1877 ms |
IteratedVector | 43.66 ms | 0.4525 ms | 0.4232 ms |
IteratedVectorAvoidNew | 43.36 ms | 0.5316 ms | 0.4973 ms |
My environment:
Notes:
My current task is to rewrite existing code to a) support more features, b) be faster. Also I'm working on lots of data - this is the current bottleneck of the whole application so no, it's not a premature optimization.
Rewriting nested loops into one - I'm not trying to optimize there. I just need to write such iterations many times, so simply wanted to simplify the code, nothing more. But because it's a performance-critical part of the code, I'm measuring such changes in design. Now, when I see that simply by storing three variables into a struct I double the processing time... I'm quite scared of using structs like that...
This relates to the difference between a memory access and a register access.
TL;DR:
With raw variables everything can be placed into registers, whereas with a struct everything has to be accessed from the stack, which is a memory access. Accessing a register is significantly faster than accessing memory.
Now, onto the full explanation:
C# is JIT compiled at launch (this is slightly different from the JVM, but that isn't important right now), because of this we can see the actual assembly generated (check here for how to view it).
For this I am only comparing IteratedVariables
and IteratedVector
because you're going to get the general gist with just these. First we have IteratedVariables
:
; int countX = 64;
in al, dx
push edi
push esi
push ebx
; int result = 0;
xor ebx, ebx
; int x = 0, y = 0, z = 0;
xor edi, edi
; int x = 0, y = 0, z = 0;
xor ecx, ecx
xor esi, esi
; while(true) {
; result += ((x ^ y) ^ (~z));
LOOP:
mov eax, edi
xor eax, ecx
mov edx, esi
not edx
xor eax, edx
add ebx, eax
; ++z;
inc esi
; if(z >= countZ)
cmp esi, 40h
jl LOOP
; {
; z = 0;
xor esi, esi
; ++y;
inc ecx
; if(y >= countY)
cmp ecx, 40h
jl LOOP
; {
; y = 0;
xor ecx, ecx
; ++x;
inc edi
; if(x >= countX)
cmp edi, 40h
jl LOOP
; {
; break;
; } } } }
; return result;
mov eax, ebx
pop ebx
pop esi
pop edi
pop ebp
ret
I've done a little work to clean up the code, all of the comments (lines marked with semicolons (;
)) are from the actual C# code (these were generated for me), I've cleaned them up a bit for brevity. The primary thing you should notice here is that everything is accessing a register, there is no raw memory access (A raw memory access can be somewhat identified by []
around a register name).
In the second example (IteratedVector
) we will see a slightly different code piece:
; int countX = 64;
push ebp
mov ebp, esp
sub esp, 0Ch
xor eax, eax
mov dword ptr [ebp-0Ch], eax
mov dword ptr [ebp-8], eax
mov dword ptr [ebp-4], eax
; int result = 0;
xor ecx,ecx
; Vector3i iter = new Vector3i(0, 0, 0);
mov dword ptr [ebp-0Ch], ecx
mov dword ptr [ebp-8], ecx
mov dword ptr [ebp-4], ecx
; while(true) {
; result += ((iter.X ^ iter.Y) ^ (~iter.Z));
LOOP:
mov eax, dword ptr [ebp-0Ch]
xor eax, dword ptr [ebp-8]
mov edx, dword ptr [ebp-4]
not edx
xor eax, edx
add ecx, eax
; ++iter.Z;
lea eax, [ebp-4]
inc dword ptr [eax]
; if(iter.Z >= countZ)
cmp dword ptr [ebp-4], 40h
jl LOOP
; {
; iter.Z = 0;
xor edx, edx
mov dword ptr [ebp-4], edx
; ++iter.Y;
lea eax, [ebp-8]
inc dword ptr [eax]
; if(iter.Y >= countY)
cmp dword ptr [ebp-8], 40h
jl LOOP
; {
; iter.Y = 0;
xor edx, edx
mov dword ptr [ebp-8], edx
; ++iter.X;
lea eax, [ebp-0Ch]
inc dword ptr [eax]
; if(iter.X >= countX)
cmp dword ptr [ebp-0Ch], 40h
jl LOOP
; {
; break;
; } } } }
; return result;
mov eax, ecx
mov esp, ebp
; {
; break;
; } } } }
; return result;
pop ebp
ret
Here you will distinctly notice lot's of raw memory accesses, they are identified by the square brackets ([]
), they also have the tag dword ptr
, don't worry too much about what that means, but just think of it as Memory Access
. You will notice that the code here is riddled with them. They are everywhere that a value access from the struct occurs.
This is the reason why the struct code is so much slower, registers are right next to the CPU (literally inside it), but memory is far away, even if it is in the CPU cache it will still be significantly slower to access then registers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With