Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance penalty when Generic.List<T>.Add is the last statement in a function and tailcall optimization is on

Tags:

.net

f#

cil

I've run into a strange performance penalty that I've boiled down to this code:

[<Struct>]
type Vector3(x: float32, y: float32, z: float32) =
  member this.X = x
  member this.Y = y
  member this.Z = z

type Data(n: int) =   
  let positions = System.Collections.Generic.List<Vector3>()
  let add j = positions.Add (Vector3(j, j, j))
  let add1 j = positions.Add (Vector3(j, j, j)); ()
  member this.UseAdd () = for i = 1 to n do add (float32 i)
  member this.UseAdd1 () = for i = 1 to n do add1 (float32 i)

let timeIt name (f: unit -> unit) = 
  let timer = System.Diagnostics.Stopwatch.StartNew()
  f ()
  printfn "%s: %ims" name (int timer.ElapsedMilliseconds)

let test () =
  for i = 1 to 3 do timeIt "ADD" (fun () -> Data(1000000).UseAdd())
  for i = 1 to 3 do timeIt "ADD1" (fun () -> Data(1000000).UseAdd1())

[<EntryPoint>]
let main argv = 
  test ()
  0

The difference between add and add1 is the extra () at the end.

When I build it as x64 Release build using F# 3.1 on .NET 4.5.1 I get this output:

ADD: 461ms
ADD: 457ms
ADD: 450ms
ADD1: 25ms
ADD1: 26ms
ADD1: 16ms

Since the type of List<T>.Add is T -> unit I would expect that add and add1 should behave identically.

Using ILdasm I've found that add compiles to (including only the relevant part)

IL_000a:  newobj     instance void Program/Vector3::.ctor(float32,
                                                          float32,
                                                          float32)
IL_000f:  tail.
IL_0011:  callvirt   instance void class [mscorlib]System.Collections.Generic.List`1<valuetype Program/Vector3>::Add(!0)

while add1 into

IL_000a:  newobj     instance void Program/Vector3::.ctor(float32,
                                                          float32,
                                                          float32)
IL_000f:  callvirt   instance void class [mscorlib]System.Collections.Generic.List`1<valuetype Program/Vector3>::Add(!0)

i.e. without the "tail call". So when I turn off tail call optimization, both add and add1 run at the same speed.

Why does the tail. instruction cause the function call to be that much slower? Also, is this a bug or a feature?


EDIT: This is the original code here I noticed this behavior. When the true value at the end is dropped, it exhibits the same performance drop as the code above.

let makeAtom (ctx: CleanCifContext) (element: CleanCifAtomSiteElement) = 
  let residue = getResidue ctx element

  let position =
    Vector3(float32 (element.PositionX.ValueOrFail()), float32 (element.PositionY.ValueOrFail()), float32 (element.PositionZ.ValueOrFail()))
  let atom = 
    CifAtom(id = ctx.Atoms.Count, element = element.ElementSymbol.ValueOrFail(),
            residue = residue, serialNumber = element.Id.ValueOrFail(), 
            name = element.Name.ValueOrFail(), authName = element.AuthName.Value(), altLoc = element.AltLoc.Value(),
            occupancy = float32 (element.Occupancy.ValueOrFail()), tempFactor = float32 (element.TempFactor.ValueOrFail()))

  ctx.Atoms.Add atom
  ctx.Positions.Add position
  true
like image 887
Dave Avatar asked Feb 21 '15 18:02

Dave


1 Answers

I think I've figured out where the problem is and why it is my misunderstanding of the problem rather than bug in the F# compiler or .NET.

The code

let add j = positions.Add (Vector3(j, j, j))

means roughly "call List<T>.Add from the tailcall position on the value Vector3(j, j, j)" while

let add1 j = positions.Add (Vector3(j, j, j)); ()

means "call List<T>.Add on the value Vector3(j, j, j) and then return unit".

Type-wise, there is no difference as List<T>.Add returns unit so I incorrectly assumed positions.Add would get called and then add would return the value unit which is the return value of List<T>.Add. However, as stated at http://blogs.msdn.com/b/clrcodegeneration/archive/2009/05/11/tail-call-improvements-in-net-framework-4.aspx, the JIT needs to perform some "stack magic" when the arguments of the tail-called function are non-trivial. And this is where the performance gap comes from. The difference is very subtle, but it is there.

like image 127
Dave Avatar answered Nov 17 '22 10:11

Dave