I've run into a strange performance penalty that I've boiled down to this code:
[<Struct>]
type Vector3(x: float32, y: float32, z: float32) =
member this.X = x
member this.Y = y
member this.Z = z
type Data(n: int) =
let positions = System.Collections.Generic.List<Vector3>()
let add j = positions.Add (Vector3(j, j, j))
let add1 j = positions.Add (Vector3(j, j, j)); ()
member this.UseAdd () = for i = 1 to n do add (float32 i)
member this.UseAdd1 () = for i = 1 to n do add1 (float32 i)
let timeIt name (f: unit -> unit) =
let timer = System.Diagnostics.Stopwatch.StartNew()
f ()
printfn "%s: %ims" name (int timer.ElapsedMilliseconds)
let test () =
for i = 1 to 3 do timeIt "ADD" (fun () -> Data(1000000).UseAdd())
for i = 1 to 3 do timeIt "ADD1" (fun () -> Data(1000000).UseAdd1())
[<EntryPoint>]
let main argv =
test ()
0
The difference between add
and add1
is the extra ()
at the end.
When I build it as x64 Release build using F# 3.1 on .NET 4.5.1 I get this output:
ADD: 461ms
ADD: 457ms
ADD: 450ms
ADD1: 25ms
ADD1: 26ms
ADD1: 16ms
Since the type of List<T>.Add
is T -> unit
I would expect that add
and add1
should behave identically.
Using ILdasm I've found that add
compiles to (including only the relevant part)
IL_000a: newobj instance void Program/Vector3::.ctor(float32,
float32,
float32)
IL_000f: tail.
IL_0011: callvirt instance void class [mscorlib]System.Collections.Generic.List`1<valuetype Program/Vector3>::Add(!0)
while add1
into
IL_000a: newobj instance void Program/Vector3::.ctor(float32,
float32,
float32)
IL_000f: callvirt instance void class [mscorlib]System.Collections.Generic.List`1<valuetype Program/Vector3>::Add(!0)
i.e. without the "tail call". So when I turn off tail call optimization, both add
and add1
run at the same speed.
Why does the tail.
instruction cause the function call to be that much slower? Also, is this a bug or a feature?
EDIT: This is the original code here I noticed this behavior. When the true
value at the end is dropped, it exhibits the same performance drop as the code above.
let makeAtom (ctx: CleanCifContext) (element: CleanCifAtomSiteElement) =
let residue = getResidue ctx element
let position =
Vector3(float32 (element.PositionX.ValueOrFail()), float32 (element.PositionY.ValueOrFail()), float32 (element.PositionZ.ValueOrFail()))
let atom =
CifAtom(id = ctx.Atoms.Count, element = element.ElementSymbol.ValueOrFail(),
residue = residue, serialNumber = element.Id.ValueOrFail(),
name = element.Name.ValueOrFail(), authName = element.AuthName.Value(), altLoc = element.AltLoc.Value(),
occupancy = float32 (element.Occupancy.ValueOrFail()), tempFactor = float32 (element.TempFactor.ValueOrFail()))
ctx.Atoms.Add atom
ctx.Positions.Add position
true
I think I've figured out where the problem is and why it is my misunderstanding of the problem rather than bug in the F# compiler or .NET.
The code
let add j = positions.Add (Vector3(j, j, j))
means roughly "call List<T>.Add
from the tailcall position on the value Vector3(j, j, j)
" while
let add1 j = positions.Add (Vector3(j, j, j)); ()
means "call List<T>.Add
on the value Vector3(j, j, j)
and then return unit
".
Type-wise, there is no difference as List<T>.Add
returns unit
so I incorrectly assumed positions.Add
would get called and then add
would return the value unit
which is the return value of List<T>.Add
. However, as stated at http://blogs.msdn.com/b/clrcodegeneration/archive/2009/05/11/tail-call-improvements-in-net-framework-4.aspx, the JIT needs to perform some "stack magic" when the arguments of the tail-called function are non-trivial. And this is where the performance gap comes from. The difference is very subtle, but it is there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With