I really like the
for (line <- Source fromFile inputPath getLines) {doSomething line}
construction for iterating over a file in scala and am wondering if there is a way to use a similar construction for iterating over the lines in all the files in a directory.
An important restriction here is that all files add up to an amount of space that would generate a heap overflow. (think dozens of GB, so increasing heap size isn't an option) As a work around for the time being, I have been cat'ing every together into one file and using the above construction which works b/c of laziness.
Point being, this seems to raise questions like.. can I concatenate two (hundred) lazy iterators and get a really big, really lazy one?
Yes, although it's not quite so concise:
import java.io.File
import scala.io.Source
for {
file <- new File(dir).listFiles.toIterator if file.isFile
line <- Source fromFile file getLines
} { doSomething line }
The trick is flatMap
and its for
-comprehension syntactic sugar. The above, for example, is more or less equivalent to the following:
new File(dir)
.listFiles.toIterator
.filter(_.isFile)
.flatMap(Source fromFile _ getLines)
.map(doSomething)
As Daniel Sobral notes in a comment below, this approach (and the code in your question) will leave files open. If this is a one-off script or you're just working in the REPL, this might not be a big deal. If you do run into problems, you can use the pimp-my-library pattern to implement some basic resource management:
implicit def toClosingSource(source: Source) = new {
val lines = source.getLines
var stillOpen = true
def getLinesAndClose = new Iterator[String] {
def hasNext = stillOpen && lines.hasNext
def next = {
val line = lines.next
if (!lines.hasNext) { source.close() ; stillOpen = false }
line
}
}
}
Now just use Source fromFile file getLinesAndClose
and you won't have to worry about files being left open.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With