Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala Iterable Memory Leaks

I recently started playing with Scala and ran across the following. Below are 4 different ways to iterate through the lines of a file, do some stuff, and write the result to another file. Some of these methods work as I would think (though using a lot of memory to do so) and some eat memory to no end.

The idea was to wrap Scala's getLines Iterator as an Iterable. I don't care if it reads the file multiple times - that's what I expect it to do.

Here's my repro code:

class FileIterable(file: java.io.File) extends Iterable[String] {
  override def iterator = io.Source.fromFile(file).getLines
}

// Iterator

// Option 1: Direct iterator - holds at 100MB
def lines = io.Source.fromFile(file).getLines

// Option 2: Get iterator via method - holds at 100MB
def lines = new FileIterable(file).iterator

// Iterable

// Option 3: TraversableOnce wrapper - holds at 2GB
def lines = io.Source.fromFile(file).getLines.toIterable

// Option 4: Iterable wrapper - leaks like a sieve
def lines = new FileIterable(file)

def values = lines
      .drop(1)
      //.map(l => l.split("\t")).map(l => l.reduceLeft(_ + "|" + _))
      //.filter(l => l.startsWith("*"))

val writer = new java.io.PrintWriter(new File("out.tsv"))
values.foreach(v => writer.println(v))
writer.close()

The file it's reading is ~10GB with 1MB lines.

The first two options iterate the file using a constant amount of memory (~100MB). This is what I would expect. The downside here is that an iterator can only be used once and it's using Scala's call-by-name convention as a psuedo-iterable. (For reference, the equivalent c# code uses ~14MB)

The third method calls toIterable defined in TraverableOnce. This one works, but it uses about 2GB to do the same work. No idea where the memory is going because it can't cache the entire Iterable.

The fourth is the most alarming - it immediately uses all available memory and throws an OOM exception. Even weirder is that it does this for all of the operations I've tested: drop, map, and filter. Looking at the implementations, none of them seem to maintain much state (though the drop looks a little suspect - why does it not just count the items?). If I do no operations, it works fine.

My guess is that somewhere it's maintaining references to each of the lines read, though I can't imagine how. I've seen the same memory usage when passing Iterables around in Scala. For example if I take case 3 (.toIterable) and pass that to a method that writes an Iterable[String] to a file, I see the same explosion.

Any ideas?

like image 921
Matt Bossenbroek Avatar asked Sep 20 '12 01:09

Matt Bossenbroek


1 Answers

Note how the ScalaDoc of Iterable says:

Implementations of this trait need to provide a concrete method with signature:

  def iterator: Iterator[A]

They also need to provide a method newBuilder which creates a builder for collections of the same kind.

Since you don't provide an implementation for newBuilder, you get the default implementation, which uses a ListBuffer and thus tries to fit everything into memory.

You might want to implement Iterable.drop as

def drop(n: Int) = iterator.drop(n).toIterable

but that would break with the representation invariance of the collection library (i.e. iterator.toIterable returns a Stream, while you want List.drop to return a List etc - thus the need for the Builder concept).

like image 166
themel Avatar answered Sep 20 '22 15:09

themel