Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterating over the lines of a file

Tags:

iterator

io

scala

I'd like to write a simple function that iterates over the lines of a text file. I believe in 2.8 one could do:

def lines(filename: String) : Iterator[String] = { 
    scala.io.Source.fromFile(filename).getLines
}

and that was that, but in 2.9 the above doesn't work and instead I must do:

def lines(filename: String) : Iterator[String] = { 
    scala.io.Source.fromFile(new File(filename)).getLines()
}

Now, the trouble is, I want to compose the above iterators in a for comprehension:

for ( l1 <- lines("file1.txt"); l2 <- lines("file2.txt") ){ 
    do_stuff(l1, l2) 
}

This again, used to work fine with 2.8 but causes a "too many open files" exception to get thrown in 2.9. This is understandable -- the second lines in the comprehension ends up opening (and not closing) a file for each line in the first.

In my case, I know that the "file1.txt" is big and I don't want to suck it into
memory, but the second file is small, so I can write a different linesEager like so:

def linesEager(filename: String): Iterator[String] = 
    val buf = scala.io.Source.fromFile(new File(filename))
    val zs  = buf.getLines().toList.toIterator
    buf.close()
    zs

and then turn my for-comprehension into:

for (l1 <- lines("file1.txt"); l2 <- linesEager("file2.txt")){ 
    do_stuff(l1, l2) 
}

This works, but is clearly ugly. Can someone suggest a uniform & clean way of achieving the above. Seems like you need a way for the iterator returned by lines to close the file when it reaches the end, and this must have been happening in 2.8 which is why it worked there?

Thanks!

BTW -- here is a minimal version of the full program that shows the issue:

import java.io.PrintWriter
import java.io.File

object Fail { 

  def lines(filename: String) : Iterator[String] = { 
    val f = new File(filename)
    scala.io.Source.fromFile(f).getLines()
  }

  def main(args: Array[String]) = { 
    val smallFile = args(0)
    val bigFile   = args(1)

    println("helloworld")

    for ( w1 <- lines(bigFile)
        ; w2 <- lines(smallFile)
        ) 
    {
      if (w2 == w1){
        val msg = "%s=%s\n".format(w1, w2)
        println("found" + msg)
      }
    }

    println("goodbye")
  }

}

On 2.9.0 I compile with scalac WordsFail.scala and then I get this:

rjhala@goto:$ scalac WordsFail.scala 
rjhala@goto:$ scala Fail passwd words
helloworld
java.io.FileNotFoundException: passwd (Too many open files)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:120)
    at scala.io.Source$.fromFile(Source.scala:91)
    at scala.io.Source$.fromFile(Source.scala:76)
    at Fail$.lines(WordsFail.scala:8)
    at Fail$$anonfun$main$1.apply(WordsFail.scala:18)
    at Fail$$anonfun$main$1.apply(WordsFail.scala:17)
    at scala.collection.Iterator$class.foreach(Iterator.scala:652)
    at scala.io.BufferedSource$BufferedLineIterator.foreach(BufferedSource.scala:30)
    at Fail$.main(WordsFail.scala:17)
    at Fail.main(WordsFail.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at scala.tools.nsc.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:78)
    at scala.tools.nsc.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:24)
    at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:88)
    at scala.tools.nsc.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:78)
    at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:101)
    at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:33)
    at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:40)
    at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:56)
    at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:80)
    at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:89)
    at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
like image 224
Ranjit Jhala Avatar asked Apr 26 '12 17:04

Ranjit Jhala


2 Answers

scala-arm provides a great mechanism for automagically closing resources when you're done with them.

import resource._
import scala.io.Source

for (file1 <- managed(Source.fromFile("file1.txt"));
     l1 <- file1.getLines();
     file2 <- managed(Source.fromFile("file2.txt"));
     l2 <- file2.getLines()) {
  do_stuff(l1, l2)
}

But unless you're counting on the contents of file2.txt to change while you're looping through file1.txt, it would be best to read that into a List before you loop. There's no need to convert it into an Iterator.

like image 99
leedm777 Avatar answered Sep 20 '22 23:09

leedm777


Maybe you should take a look at scala-arm (https://github.com/jsuereth/scala-arm) and let the closing of the files (file input streams) happen automatically in the background.

like image 33
Heiko Seeberger Avatar answered Sep 20 '22 23:09

Heiko Seeberger