Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Seq give stack overflow when iterating through large csv file

I have a csv file with the following structure :

  1. The first line is a header row
  2. The remaining lines are data lines, each with the same number of commas, so we can think of the data in terms of columns

I have written a little script to go through each line of the file and return a sequence of tuples containing the column header and the length of the largest string of data in that column :

let getColumnInfo (fileName:string) =
    let delimiter = ','

    let readLinesIntoColumns (sr:StreamReader) = seq { 
        while not sr.EndOfStream do     
            yield sr.ReadLine().Split(delimiter) |> Seq.map (fun c -> c.Length )
    }

    use sr = new StreamReader(fileName)     
    let headers = sr.ReadLine().Split(delimiter) 
    let columnSizes =
        let initial = Seq.map ( fun h -> 0 ) headers
        let toMaxColLengths (accumulator:seq<int>) (line:seq<int>)  = 
             let chooseBigger a b = if a > b then a else b
             Seq.map2 chooseBigger accumulator line
        readLinesIntoColumns sr |> Seq.fold toMaxColLengths initial
    Seq.zip headers columnSizes;

This works fine on a small file. However when it trys to process a large file (> 75 Mb) it blows fsi with a StackOverflow exception. If I remove the line

Seq.map2 chooseBigger accumulator line

the program completes.

Now, my question is this : why is F# using up the stack? My understanding of sequences in F# is that the entire sequence is not held in memory, only the elements that are being processed. Therefore I expected that the lines that had already been processed would not remain on the stack. Where is my misunderstanding?

like image 408
Aidan Avatar asked Jan 17 '23 06:01

Aidan


2 Answers

I think this is a good question. Here's a simpler repro:

let test n =
    [for i in 1 .. n -> Seq.empty]
    |> List.fold (Seq.map2 max) Seq.empty
    |> Seq.iter ignore

test creats a sequence of empty sequences, calculates the max by rows, and then iterates over the resulting (empty) sequence. You'll find that with a high value of n this will cause a stack overflow, even though there aren't any values to iterate over at all!

It's a bit tricky to explain why, but here's a stab at it. The problem is that as you fold over the sequences, Seq.map2 is returning a new sequence which defers its work until it's enumerated. Thus, when you try to iterate through the resulting sequence, you end up calling back into a chain of computations n layers deep.

As Daniel explains, you can escape this by evaluating the resulting sequence eagerly (e.g. by converting it to a list).

EDIT

Here's an attempt to further explain what's going wrong. When you call Seq.map2 max s1 s2, neither s1 nor s2 is actually enumerated; you get a new sequence which, when enumerated, will enumerate both of them and compare the yielded values. Thus, if we do something like the following:

let s0 = Seq.empty
let s1 = Seq.map2 max Seq.emtpy s0
let s2 = Seq.map2 max Seq.emtpy s1
let s3 = Seq.map2 max Seq.emtpy s2
let s4 = Seq.map2 max Seq.emtpy s3
let s5 = Seq.map2 max Seq.emtpy s4
...

Then the call to Seq.map2 always returns immediately and uses constant stack space. However, enumerating s5 requires enumerating s4, which requires enumerating s3, etc. This means that enumerating s99999 will build up a huge call stack that looks sort of like:

...
(s99996's enumerator).MoveNext()
(s99997's enumerator).MoveNext()
(s99998's enumerator).MoveNext()
(s99999's enumerator).MoveNext()

and we'll get a stack overflow.

like image 198
kvb Avatar answered Feb 16 '23 01:02

kvb


Your code contains so many sequences it's hard to reason about. My guess is that's what's tripping you up. You can make this much simpler and efficient (eagerness is not all bad):

let getColumnInfo (fileName:string) =
  let delimiter = ','
  use sr = new StreamReader(fileName)
  match sr.ReadLine() with
  | null | "" -> Array.empty
  | hdr ->
    let cols = hdr.Split(delimiter)
    let counts = Array.zeroCreate cols.Length
    while not sr.EndOfStream do
      sr.ReadLine().Split(delimiter)
      |> Array.iteri (fun i fld ->
        counts.[i] <- max counts.[i] fld.Length)
    Array.zip cols counts

This assumes all lines are non-empty and have the same number of columns.

You can fix your function by changing this line to:

Seq.map2 chooseBigger accumulator line |> Seq.toList |> seq
like image 39
Daniel Avatar answered Feb 16 '23 01:02

Daniel