Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating Sequence of Sequences is Causing a StackOverflowException

I'm trying to take a large file and split it into many smaller files. The location where each split occurs is based on a predicate returned from examining the contents of each given line (isNextObject function).

I have attempted to read in the large file via the File.ReadLines function so that I can iterate through the file one line at a time without having to hold the entire file in memory. My approach was to group the sequence into a sequence of smaller sub-sequences (one per file to be written out).

I found a useful function that Tomas Petricek created on fssnip called groupWhen. This function worked great for my initial testing on a small subset of the file, but a StackoverflowException is thrown when using the real file. I am not sure how to adjust the groupWhen function to prevent this (I'm still an F# greenie).

Here is a simplified version of the code showing only the relevant parts that will recreate the StackoverflowExcpetion::

// This is the function created by Tomas Petricek where the StackoverflowExcpetion is occuring
module Seq =
  /// Iterates over elements of the input sequence and groups adjacent elements.
  /// A new group is started when the specified predicate holds about the element
  /// of the sequence (and at the beginning of the iteration).
  ///
  /// For example: 
  ///    Seq.groupWhen isOdd [3;3;2;4;1;2] = seq [[3]; [3; 2; 4]; [1; 2]]
  let groupWhen f (input:seq<_>) = seq {
    use en = input.GetEnumerator()
    let running = ref true

    // Generate a group starting with the current element. Stops generating
    // when it founds element such that 'f en.Current' is 'true'
    let rec group() = 
      [ yield en.Current
        if en.MoveNext() then
          if not (f en.Current) then yield! group() // *** Exception occurs here ***
        else running := false ]

    if en.MoveNext() then
      // While there are still elements, start a new group
      while running.Value do
        yield group() |> Seq.ofList } 

This is the gist of the code making use Tomas' function:

module Extractor =

    open System
    open System.IO
    open Microsoft.FSharp.Reflection

    // ... elided a few functions include "isNextObject" which is
    //     a string -> bool (examines the line and returns true
    //     if the string meets the criteria to that we are at the 
    //     start of the next inner file)

    let writeFile outputDir file =
        // ... write out "file" to the file system
        // NOTE: file is a seq<string>

    let writeFiles outputDir (files : seq<seq<_>>) =
        files
        |> Seq.iter (fun file -> writeFile outputDir file)

And here is the relevant code in the console application that makes use of the functions:

let lines = inputFile |> File.ReadLines

writeFiles outputDir (lines |> Seq.groupWhen isNextObject)

Any ideas on the proper way to stop groupWhen from blowing the stack? I'm not sure how I would convert the function to use an accumulator (or to use a continuation instead, which I think is the correct terminology).

like image 415
Jason Down Avatar asked Apr 16 '26 15:04

Jason Down


2 Answers

The problem with this is that the group() function returns a list, which is an eagerly evaluated data structure, which means that every time you call group() it has to run to the end, collect all results in a list, and return the list. This means that the recursive call happens within that same evaluation - i.e. truly recursively, - thus creating stack pressure.

To mitigate this problem, you could just replace the list with a lazy sequence:

let rec group() = seq {
   yield en.Current
   if en.MoveNext() then
     if not (f en.Current) then yield! group()
   else running := false }

However, I would consider less drastic approaches. This example is a good illustration of why you should avoid doing recursion yourself and resort to ready-made folds instead.

For example, judging by your description, it seems that Seq.windowed may work for you.

like image 91
Fyodor Soikin Avatar answered Apr 19 '26 15:04

Fyodor Soikin


It's easy to overuse sequences in F#, IMO. You can accidentally get stack overflows, plus they are slow.

So (not actually answering your question), personally I would just fold over the seq of lines using something like this:

let isNextObject line = 
    line = "---"

type State = {
    fileIndex : int
    filename: string
    writer: System.IO.TextWriter
    }

let makeFilename index  = 
    sprintf "File%i" index

let closeFile (state:State) =
    //state.writer.Close() // would use this in real code
    state.writer.WriteLine("=== Closing {0} ===",state.filename)

let createFile index =
    let newFilename = makeFilename index 
    let newWriter = System.Console.Out // dummy
    newWriter.WriteLine("=== Creating {0} ===",newFilename)
    // create new state with new writer 
    {fileIndex=index + 1; writer = newWriter; filename=newFilename }

let writeLine (state:State) line = 
    if isNextObject line then
        /// finish old file here    
        closeFile state
        /// create new file here and return updated state
        createFile state.fileIndex
    else
        //write the line to the current file
        state.writer.WriteLine(line)
        // return the unchanged state
        state

let processLines (lines: string seq) =
    //setup
    let initialState = createFile 1
    // process the file
    let finalState = lines |> Seq.fold writeLine initialState
    // tidy up
    closeFile finalState

(Obviously a real version would use files rather than the console)

Yes, it is crude, but it is easy to reason about, with no unpleasant surprises.

Here's a test:

processLines [
    "a"; "b"
    "---";"c"; "d"
    "---";"e"; "f"
]

And here's what the output looks like:

=== Creating File1 ===
a
b
=== Closing File1 ===
=== Creating File2 ===
c
d
=== Closing File2 ===
=== Creating File3 ===
e
f
=== Closing File3 ===
like image 27
Grundoon Avatar answered Apr 19 '26 13:04

Grundoon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!