Modifying a large file in Scala

Question

I am trying to modify a large PostScript file in Scala (some are as large as 1GB in size). The file is a group of batches, with each batch containing a code that represents the batch number, number of pages, etc.

I need to:

Search the file for the batch codes (which always start with the same line in the file)
Count the number of pages until the next batch code
Modify the batch code to include how many pages are in each batch.
Save the new file in a different location.

My current solution uses two iterators (iterA and iterB), created from Source.fromFile("file.ps").getLines. The first iterator (iterA) traverses in a while loop to the beginning of a batch code (with iterB.next being called each time as well). iterB then continues searching until the next batch code (or the end of the file), counting the number of pages it passes as it goes. Then, it updates the batch code at iterA's position, an the process repeats.

This seems very non-Scala-like and I still haven't designed a good way to save these changes into a new file.

What is a good approach to this problem? Should I ditch iterators entirely? I'd preferably like to do it without having to have the entire input or output into memory at once.

Thanks!

stephenjudkins · Accepted Answer

You could probably implement this with Scala's Stream class. I am assuming that you don't mind holding one "batch" in memory at a time.

import scala.annotation.tailrec
import scala.io._

def isBatchLine(line:String):Boolean = ...

def batchLine(size: Int):String = ...

val it = Source.fromFile("in.ps").getLines
// cannot use it.toStream here because of SI-4835
def inLines = Stream.continually(i).takeWhile(_.hasNext).map(_.next)

// Note: using `def` instead of `val` here means we don't hold
// the entire stream in memory
def batchedLinesFrom(stream: Stream[String]):Stream[String] = {
  val (batch, remainder) = stream span { !isBatchLine(_) }
  if (batch.isEmpty && remainder.isEmpty) { 
    Stream.empty
  } else {
    batchLine(batch.size) #:: batch #::: batchedLinesFrom(remainder.drop(1))
  }
}

def newLines = batchedLinesFrom(inLines dropWhile isBatchLine)

val ps = new java.io.PrintStream(new java.io.File("out.ps"))

newLines foreach ps.println

ps.close()

MxFr · Answer

If you not in pursuit of functional scala enlightenment, I'd recommend a more imperative style using java.util.Scanner#findWithinHorizon. My example is quite naive, iterating through the input twice.

val scanner = new Scanner(inFile)

val writer = new BufferedWriter(...)

def loop() = {
  // you might want to limit the horizon to prevent OutOfMemoryError
  Option(scanner.findWithinHorizon(".*YOUR-BATCH-MARKER", 0)) match {
    case Some(batch) =>
      val pageCount = countPages(batch)
      writePageCount(writer, pageCount)
      writer.write(batch)        
      loop()

    case None =>
  }
}

loop()
scanner.close()
writer.close()

Modifying a large file in Scala

Tags:

iterator

file-io

scala

scala-2.9

postscript

Andrew Conner

2 Answers

stephenjudkins

MxFr

Recent Activity

Donate For Us

Modifying a large file in Scala

Tags:

iterator

file-io

scala

scala-2.9

postscript

Andrew Conner

2 Answers

stephenjudkins

MxFr

Related questions

Recent Activity

Donate For Us