Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FSharp.Data CsvProvider performance

Tags:

mono

f#

f#-data

I have a csv file with 6 columns and 678,552 rows. Unfortunately I cannot share any data sample but the types are straightforward: int64, int64, date, date, string, string and there are no missing values.

Time to load this data in a dataframe in R using read.table: ~ 3 seconds.

Time to load this data using CsvFile.Load in F#: ~ 3 seconds.

Time to load this data in a Deedle dataframe in F#: ~ 7 seconds.

Adding inferTypes=falseand providing a schema to Deedle's Frame.ReadCsv reduces the time to ~ 3 seconds

Time to load this data using CsvProvider in F#: ~ 5 minutes.

And this 5 minutes is even after I define the types in the Schema parameter, presumably eliminating the time F# would use to infer them.

I understand that the type provider needs to do a lot more than R or CsvFile.Load in order to parse the data into the correct data type but I am surprised by the x100 speed penalty. Even more confusing is the time Deedle takes to load the data since it also needs to infer types and cast appropriately, organize in Series, etc. I would actually have expected Deedle to take longer than CsvProvider.

In this issue the bad performance of CsvProvider was caused by a large number of columns which is not my case.

I am wondering if I am doing something wrong or if there is any way to speed things up a bit.

Just to clarify: creating the provider is almost instantaneous. It is when I force the generated sequence to be realized by Seq.length df.Rows that it takes ~ 5 minutes for the fsharpi prompt to return.

I'm on a linux system, F# v4.1 on mono v4.6.1.

Here is the code for the CsvProvider

let [<Literal>] SEP = "|"
let [<Literal>] CULTURE = "sv-SE"
let [<Literal>] DATAFILE = dataroot + "all_diagnoses.csv"

type DiagnosesProvider = CsvProvider<DATAFILE, Separators=SEP, Culture=CULTURE>
let diagnoses = DiagnosesProvider()

EDIT1: I added the time Deedle takes to load the data into a frame.

EDIT2: Added the time Deedle takes if inferTypes=false and a schema is provided.

Also, supplying CacheRows=false in the CsvProvider as suggested in the comments has no perceptible effect in the time to load.

EDIT3: Ok, we are getting somewhere. For some peculiar reason it seems that Culture is the culprit. If I omit this argument, CsvProvider loads the data in ~ 7 seconds. I am unsure what could be causing this. My system's locale is en_US. The data however come from an SQL Server in swedish locale where decimal digits are separated by ',' instead of '.'. This particular dataset does not have any decimals, so I can omit Culture altogether. Another set however has 2 decimal columns and more than 1,000,000 rows. My next task is to test this on a Windows system which I don't have available at the moment.

EDIT4: Problem seems solved but I still don't understand what causes it. If I change the culture "globally" by doing:

System.Globalization.CultureInfo.DefaultThreadCurrentCulture = CultureInfo("sv-SE")
System.Threading.Thread.CurrentThread.CurrentCulture = CultureInfo("sv-SE")

and then remove the Culture="sv-SE" argument from the CsvProvider the load time is reduced to ~ 6 seconds and the decimals are parsed correctly. I'm leaving this open in case anyone can give an explanation for this behavior.

like image 505
kliron Avatar asked Feb 26 '17 14:02

kliron


1 Answers

I am trying to reproduce the problem you are seeing, since you can't share the data I tried generating some test data. However, on my machine (.NET 4.6.2 F#4.1) I don't see it taking minutes, it takes seconds.

Perhaps you can try to see how my sample application performs in your setup and we can work from that?

open System
open System.Diagnostics
open System.IO

let clock =
  let sw = Stopwatch ()
  sw.Start ()
  fun () ->
    sw.ElapsedMilliseconds

let time a =
  let before  = clock ()
  let v       = a ()
  let after   = clock ()
  after - before, v

let generateDataSet () =
  let random            = Random 19740531

  let firstDate         = DateTime(1970, 1, 1)

  let randomInt     ()  = random.Next () |> int64 |> (+) 10000000000L |> string
  let randomDate    ()  = (firstDate + (random.Next () |> float |> TimeSpan.FromSeconds)).ToString("s")
  let randomString  ()  = 
    let inline valid ch =
      match ch with
      | '"'
      | '\\'  -> ' '
      | _     -> ch
    let c   = random.Next () % 16
    let g i =
      if i = 0 || i = c + 1 then '"'
      else 32 + random.Next() % (127 - 32) |> char |> valid
    Array.init (c + 2) g |> String

  let columns =
    [|
      "Id"          , randomInt
      "ForeignId"   , randomInt
      "BirthDate"   , randomDate
      "OtherDate"   , randomDate
      "FirstName"   , randomString
      "LastName"    , randomString
    |]

  use sw      = new StreamWriter ("perf.csv")
  let headers = columns |> Array.map fst |> String.concat ";"
  sw.WriteLine headers
  for i = 0 to 700000 do
    let values = columns |> Array.map (fun (_, f) -> f ()) |> String.concat ";"
    sw.WriteLine values

open FSharp.Data

[<Literal>]
let sample = """Id;ForeignId;BirthDate;OtherDate;FirstName;LastName
11795679844;10287417237;2028-09-14T20:33:17;1993-07-21T17:03:25;",xS@ %aY)N*})Z";"ZP~;"
11127366946;11466785219;2028-02-22T08:39:57;2026-01-24T05:07:53;"H-/QA(";"g8}J?k~"
"""

type PerfFile = CsvProvider<sample, ";">

let readDataWithTp () =
  use streamReader  = new StreamReader ("perf.csv")
  let csvFile       = PerfFile.Load streamReader
  let length        = csvFile.Rows |> Seq.length
  printfn "%A" length

[<EntryPoint>]
let main argv = 
  Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory

  printfn "Generating dataset..."
  let ms, _ = time generateDataSet
  printfn "  took %d ms" ms

  printfn "Reading dataset..."
  let ms, _ = time readDataWithTp
  printfn "  took %d ms" ms

  0

The performance numbers (.NET462 on my desktop):

Generating dataset...
  took 2162 ms
Reading dataset...
  took 6156 ms

The performance numbers (Mono 4.6.2 on my Macbook Pro):

Generating dataset...
  took 4432 ms
Reading dataset...
  took 8304 ms

Update

It turns out that specifying Culture to CsvProvider explicitly seems to degrade performance alot. It can be any culture, not just sv-SE but why?

If one checks the code the provider generates for the fast and the slow cases one notice a difference:

Fast

internal sealed class csvFile@78
{
  internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2)
  {
    Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]);
    long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[1]);
    long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[2]);
    System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[3]);
    System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[4]);
    string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[5]);
    return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption));
  }
}

Slow

internal sealed class csvFile@78
{
  internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2)
  {
    Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]);
    long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[1]);
    long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[2]);
    System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[3]);
    System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[4]);
    string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[5]);
    return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption));
  }
}

More specific this is the difference:

// Fast
TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption)
// Slow
TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption)

When we specify a culture this is passed to ConvertDateTime which forwards it to GetCulture

static member GetCulture(cultureStr) =
  if String.IsNullOrWhiteSpace cultureStr 
  then CultureInfo.InvariantCulture 
  else CultureInfo cultureStr

This means that for the default case we use the CultureInfo.InvariantCulture but for any other case for each field and row we are creating a CultureInfo object. Caching could be done but it's not. The creation process itself doesn't seem to take too much time but something happens when we are parsing with a new CultureInfo object each time.

Parsing DateTime in FSharp.Data essentially is this

let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind
match DateTime.TryParse(text, cultureInfo, dateTimeStyles) with

So let's make a performance test where we use a cached CultureInfo object and another one where we create one each time.

open System
open System.Diagnostics
open System.Globalization

let clock =
  let sw = Stopwatch ()
  sw.Start ()
  fun () ->
    sw.ElapsedMilliseconds

let time a =
  let before  = clock ()
  let v       = a ()
  let after   = clock ()
  after - before, v

let perfTest c cf () =
  let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind
  let text = DateTime.Now.ToString ("", cf ())
  for i = 1 to c do
    let culture = cf ()
    DateTime.TryParse(text, culture, dateTimeStyles) |> ignore

[<EntryPoint>]
let main argv = 
  Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory

  let ct    = "sv-SE"
  let cct   = CultureInfo ct
  let count = 10000

  printfn "Using cached CultureInfo object..."
  let ms, _ = time (perfTest count (fun () -> cct))
  printfn "  took %d ms" ms

  printfn "Using fresh CultureInfo object..."
  let ms, _ = time (perfTest count (fun () -> CultureInfo ct))
  printfn "  took %d ms" ms

  0

Performance numbers on .NET 4.6.2 F#4.1:

Using cached CultureInfo object...
  took 16 ms
Using fresh CultureInfo object...
  took 5328 ms

So it seems caching the CultureInfo object in FSharp.Data should improve CsvProvider performance significantly when culture is specified.

like image 176
Just another metaprogrammer Avatar answered Oct 18 '22 08:10

Just another metaprogrammer