There are many discussions on this topic already, but I am all about flogging dead horses, particularly when I discover they may still be breathing.
I was working on parsing the unusual and exotic file format that is the CSV, and for fun I decided to characterize the performance against the 2 .net languages I know, C# and F#.
The results were...unsettling. F# won, by a wide margin, a factor of 2 or more(and I actually think it's more like .5n, but getting real benchmarks is proving to be tough since I am testing against hardware IO).
Divergent performance characteristics in something as common as reading a CSV is surprising to me(note that the coefficient means that C# wins on very small files. The more testing I am doing the more it feels like C# scales worse, which is both surprising and concerning, since it probably means I am doing it wrong).
Some notes : Core 2 duo laptop, spindle disk 80 gigs, 3 gigs ddr 800 memory, windows 7 64 bit premium, .Net 4, no power options turned on.
30,000 lines 5 wide 1 phrase 10 chars or less is giving me a factor of 3 in favor of the tail call recursion after the first run(it appears to cache the file)
300,000(same data repeated) is a factor of 2 for the tail call recursion with F#'s mutable implementation winning out slightly, but the performance signatures suggest that I am hitting the disk and not ram-disking the whole file, which causes semi-random performance spikes.
F# code
//Module used to import data from an arbitrary CSV source
module CSVImport
open System.IO
//imports the data froma path into a list of strings and an associated value
let ImportData (path:string) : List<string []> =
//recursively rips through the file grabbing a line and adding it to the
let rec readline (reader:StreamReader) (lines:List<string []>) : List<string []> =
let line = reader.ReadLine()
match line with
| null -> lines
| _ -> readline reader (line.Split(',')::lines)
//grab a file and open it, then return the parsed data
use chaosfile = new StreamReader(path)
readline chaosfile []
//a recreation of the above function using a while loop
let ImportDataWhile (path:string) : list<string []> =
use chaosfile = new StreamReader(path)
//values ina loop construct must be mutable
let mutable retval = []
//loop
while chaosfile.EndOfStream <> true do
retval <- chaosfile.ReadLine().Split(',')::retval
//return retval by just declaring it
retval
let CSVlines (path:string) : string seq=
seq { use streamreader = new StreamReader(path)
while not streamreader.EndOfStream do
yield streamreader.ReadLine() }
let ImportDataSeq (path:string) : string [] list =
let mutable retval = []
let sequencer = CSVlines path
for line in sequencer do
retval <- line.Split()::retval
retval
C# Code
using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Text;
namespace CSVparse
{
public class CSVprocess
{
public static List<string[]> ImportDataC(string path)
{
List<string[]> retval = new List<string[]>();
using(StreamReader readfile = new StreamReader(path))
{
string line = readfile.ReadLine();
while (line != null)
{
retval.Add(line.Split());
line = readfile.ReadLine();
}
}
return retval;
}
public static List<string[]> ImportDataReadLines(string path)
{
List<string[]> retval = new List<string[]>();
IEnumerable<string> toparse = File.ReadLines(path);
foreach (string split in toparse)
{
retval.Add(split.Split());
}
return retval;
}
}
}
Note the variety of implementations there. Using iterators, using sequences, using tail call optimizatons, while loops in 2 languages...
A major issue is that I am hitting the disk, and so some idiosyncracies can be accounted for by that, I intend on rewriting this code to read from a memory stream(which should be more consistent assuming I don't start to swap)
But everything I am taught/read says that while loops/for loops are faster than tail call optimizations/recursion, and every actual benchmark that I run is saying the dead opposite of that.
So I guess my question is, should I question the conventional wisdom?
Is tail call recursion really better than while loops in the .net ecosystem?
How does this work out on Mono?
I think that the difference may arise from different List
s in F# and C#. F# uses singly linked lists (see http://msdn.microsoft.com/en-us/library/dd233224.aspx) whereas in C# System.Collections.Generic.List
ist used, which is based on arrays.
Concatenation is much faster for singly linked lists, especially when you're parsing big files (you need to allocate/copy the whole array list from time to time).
Try using a LinkedList
in the C# code, I'm curious about the results :) ...
PS: Also, this would be a good example on when to use a profiler. You could easily find the "hot spot" of the C# code...
EDIT
So, I tried this for myself: I used two identical files in order to prevent caching effects. The files were 3.000.000 lines with 10 times 'abcdef', separated by comma.
The main program looks like this:
static void Main(string[] args) {
var dt = DateTime.Now;
CSVprocess.ImportDataC("test.csv"); // C# implementation
System.Console.WriteLine("Time {0}", DateTime.Now - dt);
dt = DateTime.Now;
CSVImport.ImportData("test1.csv"); // F# implementation
System.Console.WriteLine("Time {0}", DateTime.Now - dt);
}
(I also tried it with first executing the F# implementation and then the C#...)
The result is:
Running the C# solution after the F# solution gives the same performance for the F# version but 4.7 seconds for C# (I assume due to heavy memory allocation by the F# solution). Running each solution alone doesn't change the above results.
Using a file with 6.000.000 lines gives ~ 7 seconds for the C# solution, the F# solution produces an OutOfMemoryException (I'm running this on a maching with 12GB Ram ...)
So for me it seems that the conventional 'wisdom' is true and C# using a simple loop is faster for this kind of tasks ...
You really, really, really, really shouldn't be reading anything into these results - either benchmark your entire system as a form of system test, or remove the disk I/O from the benchmark. It's just going to confuse matters. It's probably better practice to take a TextReader
parameter rather than a physical path to avoid chaining the implementation to physical files.
Additionally, as a microbenchmark your test has a few other flaws:
ImportDataC
or ImportDataReadLines
? Pick and choose for clarity - and in real applications, don't duplicate implementations, but factor out similarities and define one in terms of the other..Split(',')
in F# but .Split()
in C# - do you intend to split on comma's or on whitespaces?I note that it looks like your F# is using F# list whereas C# is using .Net List. Might try changing F# to use other list type for more data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With