I have three cases to test the relative performance of classes, classes with inheritence and structs. These are to be used for tight loops so performance counts. Dot products are used as part of many algorithms in 2D and 3D geometry and I have run the profiler on real code. The below tests are indicative of real world performance problems I have seen.
The results for 100000000 times through the loop and application of the dot product gives
ControlA 208 ms ( class with inheritence )
ControlB 201 ms ( class with no inheritence )
ControlC 85 ms ( struct )
The tests were being run without debugging and optimization turned on. My question is, what is it about classes in this case that cause them to be so slow?
I presumed the JIT would still be able to inline all the calls, class or struct, so in effect the results should be identical. Note that if I disable optimizations then my results are identical.
ControlA 3239
ControlB 3228
ControlC 3213
They are always within 20ms of each other if the test is re-run.
using System;
using System.Diagnostics;
public class PointControlA
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public PointControlA(double x, double y)
{
X = x;
Y = y;
}
}
public class Point3ControlA : PointControlA
{
public double Z
{
get;
set;
}
public Point3ControlA(double x, double y, double z): base (x, y)
{
Z = z;
}
public static double Dot(Point3ControlA a, Point3ControlA b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public class Point3ControlB
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public double Z
{
get;
set;
}
public Point3ControlB(double x, double y, double z)
{
X = x;
Y = y;
Z = z;
}
public static double Dot(Point3ControlB a, Point3ControlB b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public struct Point3ControlC
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public double Z
{
get;
set;
}
public Point3ControlC(double x, double y, double z):this()
{
X = x;
Y = y;
Z = z;
}
public static double Dot(Point3ControlC a, Point3ControlC b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public class Program
{
public static void TestStructClass()
{
var vControlA = new Point3ControlA(11, 12, 13);
var vControlB = new Point3ControlB(11, 12, 13);
var vControlC = new Point3ControlC(11, 12, 13);
var sw = Stopwatch.StartNew();
var n = 10000000;
double acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlA.Dot(vControlA, vControlA);
}
Console.WriteLine("ControlA " + sw.ElapsedMilliseconds);
acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlB.Dot(vControlB, vControlB);
}
Console.WriteLine("ControlB " + sw.ElapsedMilliseconds);
acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
Console.WriteLine("ControlC " + sw.ElapsedMilliseconds);
}
public static void Main()
{
TestStructClass();
}
}
This dotnet fiddle is proof of compilation only. It does not show the performance differences.
I am trying to explain to a vendor why their choice to use classes instead of structs for small numeric types is a bad idea. I now have the test case to prove it but I can't understand why.
NOTE : I have tried to set a breakpoint in the debugger with JIT optimizations turned on but the debugger will not break. Looking at the IL with JIT optimizations turned off doesn't tell me anything.
After the answer by @pkuderov I took his code and played with it. I changed the code and found that if I forced inlining via
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static double Dot(Point3Class a)
{
return a.X * a.X + a.Y * a.Y + a.Z * a.Z;
}
the difference between the struct and class for dot product vanished. Why with some setups the attribute is not needed but for me it was is not clear. However I did not give up. There is still a performance problem with the vendor code and I think the DotProduct is not the best example.
I modified @pkuderov's code to implement Vector Add
which will create new instances of the structs and classes. The results are here
https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48
In the example I also modifed the code to pick a pseudo random vector from an array to avoid the problem of the instances sticking in the registers ( I hope ).
The results show that:
DotProduct performance is identical or maybe faster for classes
Vector Add, and I assume anything that creates a new object is slower.
Add class/class 2777ms Add struct/struct 2457ms
DotProd class/class 1909ms DotProd struct/struct 2108ms
The full code and results are here if anybody wants to try it out.
For the vector add example where an array of vectors is summed together the struct version keeps the accumulator in 3 registers
var accStruct = new Point3Struct(0, 0, 0);
for (int i = 0; i < n; i++)
accStruct = Point3Struct.Add(accStruct, pointStruct[(i + 1) % m]);
the asm body is
// load the next vector into a register
00007FFA3CA2240E vmovsd xmm3,qword ptr [rax]
00007FFA3CA22413 vmovsd xmm4,qword ptr [rax+8]
00007FFA3CA22419 vmovsd xmm5,qword ptr [rax+10h]
// Sum the accumulator (the accumulator stays in the registers )
00007FFA3CA2241F vaddsd xmm0,xmm0,xmm3
00007FFA3CA22424 vaddsd xmm1,xmm1,xmm4
00007FFA3CA22429 vaddsd xmm2,xmm2,xmm5
but for class based vector version it reads and writes out the accumulator each time to main memory which is inefficient
var accPC = new Point3Class(0, 0, 0);
for (int i = 0; i < n; i++)
accPC = Point3Class.Add(accPC, pointClass[(i + 1) % m]);
the asm body is
// Read and add both accumulator X and Xnext from main memory
00007FFA3CA2224A vmovsd xmm0,qword ptr [r14+8]
00007FFA3CA22250 vmovaps xmm7,xmm0
00007FFA3CA22255 vaddsd xmm7,xmm7,mmword ptr [r12+8]
// Read and add both accumulator Y and Ynext from main memory
00007FFA3CA2225C vmovsd xmm0,qword ptr [r14+10h]
00007FFA3CA22262 vmovaps xmm8,xmm0
00007FFA3CA22267 vaddsd xmm8,xmm8,mmword ptr [r12+10h]
// Read and add both accumulator Z and Znext from main memory
00007FFA3CA2226E vmovsd xmm9,qword ptr [r14+18h]
00007FFA3CA22283 vmovaps xmm0,xmm9
00007FFA3CA22288 vaddsd xmm0,xmm0,mmword ptr [r12+18h]
// Move accumulator accumulator X,Y,Z back to main memory.
00007FFA3CA2228F vmovsd qword ptr [rax+8],xmm7
00007FFA3CA22295 vmovsd qword ptr [rax+10h],xmm8
00007FFA3CA2229B vmovsd qword ptr [rax+18h],xmm0
So based on the above theory we can say that Struct is faster than Class because: To store class, Apple first finds memory in Heap, then maintain the extra field for RETAIN count. Also, store reference of Heap into Stack. So when it comes to access part, it has to process stack and heap.
The only difference between these two methods is that the one allocates classes, and the other allocates structs. MeasureTestC allocates structs and runs in only 17 milliseconds which is 8.6 times faster than MeasureTestB which allocates classes!
Rather than a copy, a reference to the same existing instance is used. Structures and classes in Swift have many things in common. The major difference between structs and classes is that they live in different places in memory. Structs live on the Stack(that's why structs are fast) and Classes live on Heap in RAM.
On runtime level there is no difference between structs and classes in C++ at all. So it doesn't make any performance difference whether you use struct A or class A in your code.
Update
After spending some time thinking about problem I think I'm aggree with @DavidHaim that memory jump overhead is not the case here because of caching.
Also I've added to your tests more options (and removed first one with inheritance). So I have:
Dot(cl, cl)
- initial methodDot(cl)
- which is "square product"Dot(cl.X, cl.Y, cl.Z, cl.X, cl.Y, cl.Z)
aka Dot(cl.xyz)- pass fieldsDot(st, st)
- initialDot(st)
- square productDot(st.X, st.Y, st.Z, st.X, st.Y, st.Z)
aka Dot(st.xyz) - pass fieldsDot(st6)
- wanted to check if size of struct mattersDot(x, y, z, x, y, z)
aka Dot(xyz) - just local const double variables.Result times are:
...And I don't really sure why I see these results.
Maybe for plain primitive types compiler does more aggresive pass by register optimizations, maybe it's more sure of lifetime boundaries or constantness and then more aggressive optimizations again. Maybe some kind of loop unwinding.
I think my expertise is just not enough :) But still, my results counter your results.
Full test code with results on my machine and generated IL code you can find here.
In C# classes are reference types and structs are value types. One major effect is that value types can be (and most of the time are!) allocated on the stack, while reference types are always allocated on the heap.
So every time you get access to the inner state of a reference type variable you need to dereference the pointer to memory in the heap (it's a kind of jump), while for value types it's already on the stack or even optimized out to registers.
I think you see a difference because of this.
P.S. btw, by "most of the time are" I meant boxing; it's a technique used to place value type objects on the heap (e.g. to cast value types to an interface or for dynamic method call binding).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With