Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the performance of this custom Vector2 struct so much worse than this custom Vector4?

While benchmarking some custom vector types, I discovered that, to my surprise, my Vector2 type is much slower for many basic operations when read from an array than my Vector4 type (and Vector3), despite the code itself having fewer operations, fields, and variables. Here is a greatly simplified example that demonstrates this:

using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace VectorTest
{
    [StructLayout(LayoutKind.Sequential, Pack = 4)]
    public struct TestStruct4
    {
        public float X, Y, Z, W;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public TestStruct4(float x, float y, float z, float w)
        {
            X = x;
            Y = y;
            Z = z;
            W = w;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static TestStruct4 operator +(in TestStruct4 a, in TestStruct4 b)
        {
            return new TestStruct4(
                a.X + b.X,
                a.Y + b.Y,
                a.Z + b.Z,
                a.W + b.W);
        }
    }

    [StructLayout(LayoutKind.Sequential, Pack = 4)]
    public struct TestStruct2
    {
        public float X, Y;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public TestStruct2(float x, float y)
        {
            X = x;
            Y = y;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static TestStruct2 operator +(in TestStruct2 a, in TestStruct2 b)
        {
            return new TestStruct2(
                a.X + b.X,
                a.Y + b.Y);
        }
    }

    public class Program
    {
        private const int COUNT = 10000;
        private static readonly TestStruct4[] s_arr4 = new TestStruct4[COUNT];
        private static readonly TestStruct2[] s_arr2 = new TestStruct2[COUNT];

        static unsafe void Main()
        {
            for(int i = 0; i < s_arr4.Length; i++)
                s_arr4[i] = new TestStruct4(i, i * 2, i * 3, i * 4);
            for(int i = 0; i < s_arr2.Length; i++)
                s_arr2[i] = new TestStruct2(i, i * 2);

            BenchmarkRunner.Run<Program>();
        }

        [Benchmark]
        public TestStruct4 BenchmarkTestStruct4()
        {
            TestStruct4 ret = default;
            for (int i = 0; i < COUNT; i++)
                ret += s_arr4[i];
            return ret;
        }

        [Benchmark]
        public TestStruct2 BenchmarkTestStruct2()
        {
            TestStruct2 ret = default;
            for (int i = 0; i < COUNT; i++)
                ret += s_arr2[i];
            return ret;
        }
    }
}

Running this benchmark results in:

Method Mean Error StdDev
BenchmarkTestStruct4 9.863 us 0.0706 us 0.0626 us
BenchmarkTestStruct2 22.412 us 0.3100 us 0.2899 us

As you can see, TestStruct2 is more than twice as slow as TestStruct4 (at least on my computer). Given that TestStruct2 is essentially identical to TestStruct4 except that it has fewer fields and has to do fewer adds, I would have expected it to, at worst, be the same speed as TestStruct4, but actually it's slower. Can anyone explain why this is?

Further experimentation has revealed that, if I add another float or two of padding to MyStruct2 (and use Unsafe.SkipInit to avoid the cost of initializing those), then the performance improves to match that of MyStruct4. So I'm guessing that there's some sort of alignment issue going on with MyStruct2, but I don't understand what specifically that might be. Pasting the code into SharpLab does not reveal any obvious obvious differences in the ASM (though I might not just understand the ASM well enough to spot something).

EDIT: This is running on .NET 5 on Windows 10 64-bit.

(Note: I'm NOT interested in having a discussion about whether it is prudent to write one's own vector types when there are plenty of existing types already. I have my reasons for doing do and they are beside the point of this question, which I ask out of an academic curiosity to understand why there is such a huge performance difference.)

EDIT: As requested, here are the byte layouts of the two structs:

Type layout for 'TestStruct4'
Size: 16 bytes. Paddings: 0 bytes (%0 of empty space)
|===========================|
|   0-3: Single X (4 bytes) |
|---------------------------|
|   4-7: Single Y (4 bytes) |
|---------------------------|
|  8-11: Single Z (4 bytes) |
|---------------------------|
| 12-15: Single W (4 bytes) |
|===========================|


Type layout for 'TestStruct2'
Size: 8 bytes. Paddings: 0 bytes (%0 of empty space)
|===========================|
|   0-3: Single X (4 bytes) |
|---------------------------|
|   4-7: Single Y (4 bytes) |
|===========================|
like image 477
Walt D Avatar asked Aug 10 '21 00:08

Walt D


1 Answers

In the case of the struct TestStruct4 the '+' operator overload method the generated assembly instructions use XMM registers to store and increment the value, so the addition instructions looks like this:

00007FFF72084077  vaddss      xmm0,xmm0,dword ptr [rdx]
00007FFF7208407B  vaddss      xmm1,xmm1,dword ptr [rdx+4]
00007FFF72084080  vaddss      xmm2,xmm2,dword ptr [rdx+8]
00007FFF72084085  vaddss      xmm3,xmm3,dword ptr [rdx+0Ch]

Nice and tidy. Now here is what gets generated for TestStruct2:

00007FFF6FE2B3EA  vmovss      xmm0,dword ptr [rsp+20h]
00007FFF6FE2B3F0  vaddss      xmm0,xmm0,dword ptr [rdx]
00007FFF6FE2B3F4  vmovss      xmm1,dword ptr [rsp+24h]
00007FFF6FE2B3FA  vaddss      xmm1,xmm1,dword ptr [rdx+4]
00007FFF6FE2B3FF  vmovss      dword ptr [rsp+20h],xmm0
00007FFF6FE2B405  vmovss      dword ptr [rsp+24h],xmm1

Here the '+' operator overload method assembly instructions does not store the values in XMM registers, but in a memory, because of that there is an additional overhead - at the beginning it moves the initial value from the memory to XMM, and at the end it moves the modified value back to the memory.

It is not really clear why it is happening, but it does look a lot like a compiler failure to properly optimize the code. To solve this particular issue you could change the type of the field from float to double, then it will get optimized and performance wise it will be essentially the same. Or, if changing the type is not an option, the solution would be, as you mentioned - to add a dummy field.

like image 103
Dennis Reshetnyak Avatar answered Sep 20 '22 03:09

Dennis Reshetnyak