I have a csv file with 6 columns and 678,552 rows. Unfortunately I cannot share any data sample but the types are straightforward: int64
, int64
, date
, date
, string
, string
and there are no missing values.
Time to load this data in a dataframe in R using read.table
: ~ 3 seconds.
Time to load this data using CsvFile.Load in F#: ~ 3 seconds.
Time to load this data in a Deedle dataframe in F#: ~ 7 seconds.
Adding inferTypes=false
and providing a schema to Deedle's Frame.ReadCsv
reduces the time to ~ 3 seconds
Time to load this data using CsvProvider in F#: ~ 5 minutes.
And this 5 minutes is even after I define the types in the Schema
parameter, presumably eliminating the time F# would use to infer them.
I understand that the type provider needs to do a lot more than R or CsvFile.Load in order to parse the data into the correct data type but I am surprised by the x100 speed penalty. Even more confusing is the time Deedle takes to load the data since it also needs to infer types and cast appropriately, organize in Series, etc. I would actually have expected Deedle to take longer than CsvProvider.
In this issue the bad performance of CsvProvider was caused by a large number of columns which is not my case.
I am wondering if I am doing something wrong or if there is any way to speed things up a bit.
Just to clarify: creating the provider is almost instantaneous. It is when I force the generated sequence to be realized by Seq.length df.Rows
that it takes ~ 5 minutes for the fsharpi prompt to return.
I'm on a linux system, F# v4.1 on mono v4.6.1.
Here is the code for the CsvProvider
let [<Literal>] SEP = "|"
let [<Literal>] CULTURE = "sv-SE"
let [<Literal>] DATAFILE = dataroot + "all_diagnoses.csv"
type DiagnosesProvider = CsvProvider<DATAFILE, Separators=SEP, Culture=CULTURE>
let diagnoses = DiagnosesProvider()
EDIT1: I added the time Deedle takes to load the data into a frame.
EDIT2:
Added the time Deedle takes if inferTypes=false
and a schema is provided.
Also, supplying CacheRows=false
in the CsvProvider as suggested in the comments has no perceptible effect in the time to load.
EDIT3:
Ok, we are getting somewhere. For some peculiar reason it seems that Culture
is the culprit. If I omit this argument, CsvProvider loads the data in ~ 7 seconds. I am unsure what could be causing this. My system's locale is en_US. The data however come from an SQL Server in swedish locale where decimal digits are separated by ',' instead of '.'. This particular dataset does not have any decimals, so I can omit Culture altogether. Another set however has 2 decimal columns and more than 1,000,000 rows. My next task is to test this on a Windows system which I don't have available at the moment.
EDIT4: Problem seems solved but I still don't understand what causes it. If I change the culture "globally" by doing:
System.Globalization.CultureInfo.DefaultThreadCurrentCulture = CultureInfo("sv-SE")
System.Threading.Thread.CurrentThread.CurrentCulture = CultureInfo("sv-SE")
and then remove the Culture="sv-SE"
argument from the CsvProvider the load time is reduced to ~ 6 seconds and the decimals are parsed correctly. I'm leaving this open in case anyone can give an explanation for this behavior.
I am trying to reproduce the problem you are seeing, since you can't share the data I tried generating some test data. However, on my machine (.NET 4.6.2 F#4.1) I don't see it taking minutes, it takes seconds.
Perhaps you can try to see how my sample application performs in your setup and we can work from that?
open System
open System.Diagnostics
open System.IO
let clock =
let sw = Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let time a =
let before = clock ()
let v = a ()
let after = clock ()
after - before, v
let generateDataSet () =
let random = Random 19740531
let firstDate = DateTime(1970, 1, 1)
let randomInt () = random.Next () |> int64 |> (+) 10000000000L |> string
let randomDate () = (firstDate + (random.Next () |> float |> TimeSpan.FromSeconds)).ToString("s")
let randomString () =
let inline valid ch =
match ch with
| '"'
| '\\' -> ' '
| _ -> ch
let c = random.Next () % 16
let g i =
if i = 0 || i = c + 1 then '"'
else 32 + random.Next() % (127 - 32) |> char |> valid
Array.init (c + 2) g |> String
let columns =
[|
"Id" , randomInt
"ForeignId" , randomInt
"BirthDate" , randomDate
"OtherDate" , randomDate
"FirstName" , randomString
"LastName" , randomString
|]
use sw = new StreamWriter ("perf.csv")
let headers = columns |> Array.map fst |> String.concat ";"
sw.WriteLine headers
for i = 0 to 700000 do
let values = columns |> Array.map (fun (_, f) -> f ()) |> String.concat ";"
sw.WriteLine values
open FSharp.Data
[<Literal>]
let sample = """Id;ForeignId;BirthDate;OtherDate;FirstName;LastName
11795679844;10287417237;2028-09-14T20:33:17;1993-07-21T17:03:25;",xS@ %aY)N*})Z";"ZP~;"
11127366946;11466785219;2028-02-22T08:39:57;2026-01-24T05:07:53;"H-/QA(";"g8}J?k~"
"""
type PerfFile = CsvProvider<sample, ";">
let readDataWithTp () =
use streamReader = new StreamReader ("perf.csv")
let csvFile = PerfFile.Load streamReader
let length = csvFile.Rows |> Seq.length
printfn "%A" length
[<EntryPoint>]
let main argv =
Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory
printfn "Generating dataset..."
let ms, _ = time generateDataSet
printfn " took %d ms" ms
printfn "Reading dataset..."
let ms, _ = time readDataWithTp
printfn " took %d ms" ms
0
The performance numbers (.NET462 on my desktop):
Generating dataset...
took 2162 ms
Reading dataset...
took 6156 ms
The performance numbers (Mono 4.6.2 on my Macbook Pro):
Generating dataset...
took 4432 ms
Reading dataset...
took 8304 ms
Update
It turns out that specifying Culture
to CsvProvider explicitly seems to degrade performance alot. It can be any culture, not just sv-SE
but why?
If one checks the code the provider generates for the fast and the slow cases one notice a difference:
Fast
internal sealed class csvFile@78
{
internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2)
{
Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]);
long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[1]);
long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[2]);
System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[3]);
System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[4]);
string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[5]);
return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption));
}
}
Slow
internal sealed class csvFile@78
{
internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2)
{
Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]);
long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[1]);
long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[2]);
System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[3]);
System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[4]);
string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[5]);
return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption));
}
}
More specific this is the difference:
// Fast
TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption)
// Slow
TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption)
When we specify a culture this is passed to ConvertDateTime
which forwards it to GetCulture
static member GetCulture(cultureStr) =
if String.IsNullOrWhiteSpace cultureStr
then CultureInfo.InvariantCulture
else CultureInfo cultureStr
This means that for the default case we use the CultureInfo.InvariantCulture
but for any other case for each field and row we are creating a CultureInfo
object. Caching could be done but it's not. The creation process itself doesn't seem to take too much time but something happens when we are parsing with a new CultureInfo
object each time.
Parsing DateTime
in FSharp.Data
essentially is this
let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind
match DateTime.TryParse(text, cultureInfo, dateTimeStyles) with
So let's make a performance test where we use a cached CultureInfo
object and another one where we create one each time.
open System
open System.Diagnostics
open System.Globalization
let clock =
let sw = Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let time a =
let before = clock ()
let v = a ()
let after = clock ()
after - before, v
let perfTest c cf () =
let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind
let text = DateTime.Now.ToString ("", cf ())
for i = 1 to c do
let culture = cf ()
DateTime.TryParse(text, culture, dateTimeStyles) |> ignore
[<EntryPoint>]
let main argv =
Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory
let ct = "sv-SE"
let cct = CultureInfo ct
let count = 10000
printfn "Using cached CultureInfo object..."
let ms, _ = time (perfTest count (fun () -> cct))
printfn " took %d ms" ms
printfn "Using fresh CultureInfo object..."
let ms, _ = time (perfTest count (fun () -> CultureInfo ct))
printfn " took %d ms" ms
0
Performance numbers on .NET 4.6.2 F#4.1:
Using cached CultureInfo object...
took 16 ms
Using fresh CultureInfo object...
took 5328 ms
So it seems caching the CultureInfo
object in FSharp.Data
should improve CsvProvider
performance significantly when culture is specified.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With