I have a csv file with the following structure :
I have written a little script to go through each line of the file and return a sequence of tuples containing the column header and the length of the largest string of data in that column :
let getColumnInfo (fileName:string) =
let delimiter = ','
let readLinesIntoColumns (sr:StreamReader) = seq {
while not sr.EndOfStream do
yield sr.ReadLine().Split(delimiter) |> Seq.map (fun c -> c.Length )
}
use sr = new StreamReader(fileName)
let headers = sr.ReadLine().Split(delimiter)
let columnSizes =
let initial = Seq.map ( fun h -> 0 ) headers
let toMaxColLengths (accumulator:seq<int>) (line:seq<int>) =
let chooseBigger a b = if a > b then a else b
Seq.map2 chooseBigger accumulator line
readLinesIntoColumns sr |> Seq.fold toMaxColLengths initial
Seq.zip headers columnSizes;
This works fine on a small file. However when it trys to process a large file (> 75 Mb) it blows fsi with a StackOverflow exception. If I remove the line
Seq.map2 chooseBigger accumulator line
the program completes.
Now, my question is this : why is F# using up the stack? My understanding of sequences in F# is that the entire sequence is not held in memory, only the elements that are being processed. Therefore I expected that the lines that had already been processed would not remain on the stack. Where is my misunderstanding?
I think this is a good question. Here's a simpler repro:
let test n =
[for i in 1 .. n -> Seq.empty]
|> List.fold (Seq.map2 max) Seq.empty
|> Seq.iter ignore
test
creats a sequence of empty sequences, calculates the max by rows, and then iterates over the resulting (empty) sequence. You'll find that with a high value of n
this will cause a stack overflow, even though there aren't any values to iterate over at all!
It's a bit tricky to explain why, but here's a stab at it. The problem is that as you fold over the sequences, Seq.map2
is returning a new sequence which defers its work until it's enumerated. Thus, when you try to iterate through the resulting sequence, you end up calling back into a chain of computations n
layers deep.
As Daniel explains, you can escape this by evaluating the resulting sequence eagerly (e.g. by converting it to a list).
EDIT
Here's an attempt to further explain what's going wrong. When you call Seq.map2 max s1 s2
, neither s1
nor s2
is actually enumerated; you get a new sequence which, when enumerated, will enumerate both of them and compare the yielded values. Thus, if we do something like the following:
let s0 = Seq.empty
let s1 = Seq.map2 max Seq.emtpy s0
let s2 = Seq.map2 max Seq.emtpy s1
let s3 = Seq.map2 max Seq.emtpy s2
let s4 = Seq.map2 max Seq.emtpy s3
let s5 = Seq.map2 max Seq.emtpy s4
...
Then the call to Seq.map2
always returns immediately and uses constant stack space. However, enumerating s5 requires enumerating s4, which requires enumerating s3, etc. This means that enumerating s99999 will build up a huge call stack that looks sort of like:
...
(s99996's enumerator).MoveNext()
(s99997's enumerator).MoveNext()
(s99998's enumerator).MoveNext()
(s99999's enumerator).MoveNext()
and we'll get a stack overflow.
Your code contains so many sequences it's hard to reason about. My guess is that's what's tripping you up. You can make this much simpler and efficient (eagerness is not all bad):
let getColumnInfo (fileName:string) =
let delimiter = ','
use sr = new StreamReader(fileName)
match sr.ReadLine() with
| null | "" -> Array.empty
| hdr ->
let cols = hdr.Split(delimiter)
let counts = Array.zeroCreate cols.Length
while not sr.EndOfStream do
sr.ReadLine().Split(delimiter)
|> Array.iteri (fun i fld ->
counts.[i] <- max counts.[i] fld.Length)
Array.zip cols counts
This assumes all lines are non-empty and have the same number of columns.
You can fix your function by changing this line to:
Seq.map2 chooseBigger accumulator line |> Seq.toList |> seq
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With