Why is using structure Vector3I instead of three ints much slower in C#?

Question

I'm processing lots of data in a 3D grid so I wanted to implement a simple iterator instead of three nested loops. However, I encountered a performance problem: first, I implemented a simple loop using only int x, y and z variables. Then I implemented a Vector3I structure and used that - and the calculation time doubled. Now I'm struggling with the question - why is that? What did I do wrong?

Example for reproduction:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

public struct Vector2I
{
    public int X;
    public int Y;
    public int Z;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public Vector2I(int x, int y, int z)
    {
        this.X = x;
        this.Y = y;
        this.Z = z;
    }
}

public class IterationTests
{
    private readonly int _countX;
    private readonly int _countY;
    private readonly int _countZ;
    private Vector2I _Vector = new Vector2I(0, 0, 0);


    public IterationTests()
    {
        _countX = 64;
        _countY = 64;
        _countZ = 64;
    }

    [Benchmark]
    public void NestedLoops()
    {
        int countX = _countX;
        int countY = _countY;
        int countZ = _countZ;

        int result = 0;

        for (int x = 0; x < countX; ++x)
        {
            for (int y = 0; y < countY; ++y)
            {
                for (int z = 0; z < countZ; ++z)
                {
                    result += ((x ^ y) ^ (~z));
                }
            }
        }
    }

    [Benchmark]
    public void IteratedVariables()
    {
        int countX = _countX;
        int countY = _countY;
        int countZ = _countZ;

        int result = 0;

        int x = 0, y = 0, z = 0;
        while (true)
        {
            result += ((x ^ y) ^ (~z));

            ++z;
            if (z >= countZ)
            {
                z = 0;
                ++y;

                if (y >= countY)
                {
                    y = 0;
                    ++x;

                    if (x >= countX)
                    {
                        break;
                    }
                }
            }
        }
    }

    [Benchmark]
    public void IteratedVector()
    {
        int countX = _countX;
        int countY = _countY;
        int countZ = _countZ;

        int result = 0;

        Vector2I iter = new Vector2I(0, 0, 0);
        while (true)
        {
            result += ((iter.X ^ iter.Y) ^ (~iter.Z));

            ++iter.Z;
            if (iter.Z >= countZ)
            {
                iter.Z = 0;
                ++iter.Y;

                if (iter.Y >= countY)
                {
                    iter.Y = 0;
                    ++iter.X;

                    if (iter.X >= countX)
                    {
                        break;
                    }
                }
            }
        }
    }

    [Benchmark]
    public void IteratedVectorAvoidNew()
    {
        int countX = _countX;
        int countY = _countY;
        int countZ = _countZ;

        int result = 0;

        Vector2I iter = _Vector;

        iter.X = 0;
        iter.Y = 0;
        iter.Z = 0;
        while (true)
        {
            result += ((iter.X ^ iter.Y) ^ (~iter.Z));

            ++iter.Z;
            if (iter.Z >= countZ)
            {
                iter.Z = 0;
                ++iter.Y;

                if (iter.Y >= countY)
                {
                    iter.Y = 0;
                    ++iter.X;

                    if (iter.X >= countX)
                    {
                        break;
                    }
                }
            }
        }
    }
}

public static class Program
{
    public static void Main(string[] args)
    {
        BenchmarkRunner.Run<IterationTests>();
    }
}

What I measured:

                 Method |     Mean |     Error |    StdDev |
----------------------- |---------:|----------:|----------:|
            NestedLoops | 333.9 us | 4.6837 us | 4.3811 us |
      IteratedVariables | 291.0 us | 0.8792 us | 0.6864 us |
         IteratedVector | 702.1 us | 4.8590 us | 4.3073 us |
 IteratedVectorAvoidNew | 725.8 us | 6.4850 us | 6.0661 us |

Note: the 'IteratedVectorAvoidNew' is there due to discussion that the problem might lie in the new operator of Vector3I - originally, I used a custom iteration loop and measured with a stopwatch.

Additionally, a benchmark of when I iterate over a 256×256×256 area:

                 Method |     Mean |     Error |    StdDev |
----------------------- |---------:|----------:|----------:|
            NestedLoops | 18.67 ms | 0.0504 ms | 0.0446 ms |
      IteratedVariables | 18.80 ms | 0.2006 ms | 0.1877 ms |
         IteratedVector | 43.66 ms | 0.4525 ms | 0.4232 ms |
 IteratedVectorAvoidNew | 43.36 ms | 0.5316 ms | 0.4973 ms |

My environment:

Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Windows 10, 64 bit
Visual Studio 2017
Language: C#
Yes, I selected Release configuration

Notes:

My current task is to rewrite existing code to a) support more features, b) be faster. Also I'm working on lots of data - this is the current bottleneck of the whole application so no, it's not a premature optimization.

Rewriting nested loops into one - I'm not trying to optimize there. I just need to write such iterations many times, so simply wanted to simplify the code, nothing more. But because it's a performance-critical part of the code, I'm measuring such changes in design. Now, when I see that simply by storing three variables into a struct I double the processing time... I'm quite scared of using structs like that...

van dench · Accepted Answer

This relates to the difference between a memory access and a register access.

TL;DR:
With raw variables everything can be placed into registers, whereas with a struct everything has to be accessed from the stack, which is a memory access. Accessing a register is significantly faster than accessing memory.

Now, onto the full explanation:

C# is JIT compiled at launch (this is slightly different from the JVM, but that isn't important right now), because of this we can see the actual assembly generated (check here for how to view it).

For this I am only comparing IteratedVariables and IteratedVector because you're going to get the general gist with just these. First we have IteratedVariables:

                    ; int countX = 64;
in   al, dx  
push edi  
push esi  
push ebx  
                    ; int result = 0;
xor ebx, ebx  
                    ; int x = 0, y = 0, z = 0;
xor edi, edi  
                    ; int x = 0, y = 0, z = 0;
xor ecx, ecx  
xor esi, esi  
                    ; while(true) {
                    ;     result += ((x ^ y) ^ (~z));
LOOP:
    mov eax, edi  
    xor eax, ecx  
    mov edx, esi  
    not edx   
    xor eax, edx  
    add ebx, eax 
                    ; ++z;
    inc esi  
                    ; if(z >= countZ)
    cmp esi, 40h  
    jl  LOOP  
                    ; {
                    ;     z = 0;
    xor esi, esi  
                    ; ++y;
    inc ecx  
                    ; if(y >= countY)
    cmp ecx, 40h  
    jl  LOOP  
                    ; {
                    ;     y = 0;
    xor ecx, ecx  
                    ; ++x;
    inc edi  
                    ; if(x >= countX)
    cmp edi, 40h  
    jl  LOOP  
                    ; {
                    ;     break;
                    ; } } } }
                    ; return result;
mov eax, ebx  
pop ebx  
pop esi  
pop edi  
pop ebp  
ret

I've done a little work to clean up the code, all of the comments (lines marked with semicolons (;)) are from the actual C# code (these were generated for me), I've cleaned them up a bit for brevity. The primary thing you should notice here is that everything is accessing a register, there is no raw memory access (A raw memory access can be somewhat identified by [] around a register name).

In the second example (IteratedVector) we will see a slightly different code piece:

                                    ; int countX = 64;
push ebp  
mov  ebp, esp  
sub  esp, 0Ch  
xor  eax, eax  
mov  dword ptr [ebp-0Ch], eax  
mov  dword ptr [ebp-8],   eax  
mov  dword ptr [ebp-4],   eax  
                                    ; int result = 0;
xor ecx,ecx  
                                    ; Vector3i iter = new Vector3i(0, 0, 0);
mov dword ptr [ebp-0Ch], ecx  
mov dword ptr [ebp-8],   ecx  
mov dword ptr [ebp-4],   ecx  
                                    ; while(true) {
                                    ;     result += ((iter.X ^ iter.Y) ^ (~iter.Z));
LOOP:
    mov eax, dword ptr [ebp-0Ch]  
    xor eax, dword ptr [ebp-8]  
    mov edx, dword ptr [ebp-4]  
    not edx  
    xor eax, edx  
    add ecx, eax  
                                    ; ++iter.Z;
    lea eax, [ebp-4]  
    inc dword ptr [eax]  
                                    ; if(iter.Z >= countZ)
    cmp dword ptr [ebp-4], 40h  
    jl  LOOP  
                                    ; {
                                    ;     iter.Z = 0;
    xor edx, edx  
    mov dword ptr [ebp-4], edx  
                                    ; ++iter.Y;
    lea eax, [ebp-8]  
    inc dword ptr [eax]  
                                    ; if(iter.Y >= countY)
    cmp dword ptr [ebp-8], 40h  
    jl  LOOP  
                                    ; {
                                    ;     iter.Y = 0;
    xor edx, edx  
    mov dword ptr [ebp-8], edx  
                                    ; ++iter.X;
    lea eax, [ebp-0Ch]  
    inc dword ptr [eax]  
                                    ; if(iter.X >= countX)
    cmp dword ptr [ebp-0Ch], 40h  
    jl  LOOP  
                                    ; {
                                    ;     break;
                                    ; } } } }
                                    ; return result;
mov eax, ecx  
mov esp, ebp  
                                    ;  {
                                    ;      break;
                                    ;  } } } }
                                    ;  return result;
pop ebp  
ret

Here you will distinctly notice lot's of raw memory accesses, they are identified by the square brackets ([]), they also have the tag dword ptr, don't worry too much about what that means, but just think of it as Memory Access. You will notice that the code here is riddled with them. They are everywhere that a value access from the struct occurs.

This is the reason why the struct code is so much slower, registers are right next to the CPU (literally inside it), but memory is far away, even if it is in the CPU cache it will still be significantly slower to access then registers.

Why is using structure Vector3I instead of three ints much slower in C#?

Tags:

performance

c#

x86

compiler-optimization

intel

micro-optimization

Tom

1 Answers

van dench

Recent Activity

Donate For Us

Why is using structure Vector3I instead of three ints much slower in C#?

Tags:

performance

c#

x86

compiler-optimization

intel

micro-optimization

Tom

1 Answers

van dench

Related questions

Recent Activity

Donate For Us