While benchmarking some custom vector types, I discovered that, to my surprise, my Vector2 type is much slower for many basic operations when read from an array than my Vector4 type (and Vector3), despite the code itself having fewer operations, fields, and variables. Here is a greatly simplified example that demonstrates this:
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace VectorTest
{
[StructLayout(LayoutKind.Sequential, Pack = 4)]
public struct TestStruct4
{
public float X, Y, Z, W;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public TestStruct4(float x, float y, float z, float w)
{
X = x;
Y = y;
Z = z;
W = w;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static TestStruct4 operator +(in TestStruct4 a, in TestStruct4 b)
{
return new TestStruct4(
a.X + b.X,
a.Y + b.Y,
a.Z + b.Z,
a.W + b.W);
}
}
[StructLayout(LayoutKind.Sequential, Pack = 4)]
public struct TestStruct2
{
public float X, Y;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public TestStruct2(float x, float y)
{
X = x;
Y = y;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static TestStruct2 operator +(in TestStruct2 a, in TestStruct2 b)
{
return new TestStruct2(
a.X + b.X,
a.Y + b.Y);
}
}
public class Program
{
private const int COUNT = 10000;
private static readonly TestStruct4[] s_arr4 = new TestStruct4[COUNT];
private static readonly TestStruct2[] s_arr2 = new TestStruct2[COUNT];
static unsafe void Main()
{
for(int i = 0; i < s_arr4.Length; i++)
s_arr4[i] = new TestStruct4(i, i * 2, i * 3, i * 4);
for(int i = 0; i < s_arr2.Length; i++)
s_arr2[i] = new TestStruct2(i, i * 2);
BenchmarkRunner.Run<Program>();
}
[Benchmark]
public TestStruct4 BenchmarkTestStruct4()
{
TestStruct4 ret = default;
for (int i = 0; i < COUNT; i++)
ret += s_arr4[i];
return ret;
}
[Benchmark]
public TestStruct2 BenchmarkTestStruct2()
{
TestStruct2 ret = default;
for (int i = 0; i < COUNT; i++)
ret += s_arr2[i];
return ret;
}
}
}
Running this benchmark results in:
Method | Mean | Error | StdDev |
---|---|---|---|
BenchmarkTestStruct4 | 9.863 us | 0.0706 us | 0.0626 us |
BenchmarkTestStruct2 | 22.412 us | 0.3100 us | 0.2899 us |
As you can see, TestStruct2 is more than twice as slow as TestStruct4 (at least on my computer). Given that TestStruct2 is essentially identical to TestStruct4 except that it has fewer fields and has to do fewer adds, I would have expected it to, at worst, be the same speed as TestStruct4, but actually it's slower. Can anyone explain why this is?
Further experimentation has revealed that, if I add another float or two of padding to MyStruct2 (and use Unsafe.SkipInit
to avoid the cost of initializing those), then the performance improves to match that of MyStruct4. So I'm guessing that there's some sort of alignment issue going on with MyStruct2, but I don't understand what specifically that might be. Pasting the code into SharpLab does not reveal any obvious obvious differences in the ASM (though I might not just understand the ASM well enough to spot something).
EDIT: This is running on .NET 5 on Windows 10 64-bit.
(Note: I'm NOT interested in having a discussion about whether it is prudent to write one's own vector types when there are plenty of existing types already. I have my reasons for doing do and they are beside the point of this question, which I ask out of an academic curiosity to understand why there is such a huge performance difference.)
EDIT: As requested, here are the byte layouts of the two structs:
Type layout for 'TestStruct4'
Size: 16 bytes. Paddings: 0 bytes (%0 of empty space)
|===========================|
| 0-3: Single X (4 bytes) |
|---------------------------|
| 4-7: Single Y (4 bytes) |
|---------------------------|
| 8-11: Single Z (4 bytes) |
|---------------------------|
| 12-15: Single W (4 bytes) |
|===========================|
Type layout for 'TestStruct2'
Size: 8 bytes. Paddings: 0 bytes (%0 of empty space)
|===========================|
| 0-3: Single X (4 bytes) |
|---------------------------|
| 4-7: Single Y (4 bytes) |
|===========================|
In the case of the struct TestStruct4
the '+' operator overload method the generated assembly instructions use XMM registers to store and increment the value, so the addition instructions looks like this:
00007FFF72084077 vaddss xmm0,xmm0,dword ptr [rdx]
00007FFF7208407B vaddss xmm1,xmm1,dword ptr [rdx+4]
00007FFF72084080 vaddss xmm2,xmm2,dword ptr [rdx+8]
00007FFF72084085 vaddss xmm3,xmm3,dword ptr [rdx+0Ch]
Nice and tidy. Now here is what gets generated for TestStruct2
:
00007FFF6FE2B3EA vmovss xmm0,dword ptr [rsp+20h]
00007FFF6FE2B3F0 vaddss xmm0,xmm0,dword ptr [rdx]
00007FFF6FE2B3F4 vmovss xmm1,dword ptr [rsp+24h]
00007FFF6FE2B3FA vaddss xmm1,xmm1,dword ptr [rdx+4]
00007FFF6FE2B3FF vmovss dword ptr [rsp+20h],xmm0
00007FFF6FE2B405 vmovss dword ptr [rsp+24h],xmm1
Here the '+' operator overload method assembly instructions does not store the values in XMM registers, but in a memory, because of that there is an additional overhead - at the beginning it moves the initial value from the memory to XMM, and at the end it moves the modified value back to the memory.
It is not really clear why it is happening, but it does look a lot like a compiler failure to properly optimize the code. To solve this particular issue you could change the type of the field from float
to double
, then it will get optimized and performance wise it will be essentially the same. Or, if changing the type is not an option, the solution would be, as you mentioned - to add a dummy field.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With