Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modifying a large file in Scala

I am trying to modify a large PostScript file in Scala (some are as large as 1GB in size). The file is a group of batches, with each batch containing a code that represents the batch number, number of pages, etc.

I need to:

  1. Search the file for the batch codes (which always start with the same line in the file)
  2. Count the number of pages until the next batch code
  3. Modify the batch code to include how many pages are in each batch.
  4. Save the new file in a different location.

My current solution uses two iterators (iterA and iterB), created from Source.fromFile("file.ps").getLines. The first iterator (iterA) traverses in a while loop to the beginning of a batch code (with iterB.next being called each time as well). iterB then continues searching until the next batch code (or the end of the file), counting the number of pages it passes as it goes. Then, it updates the batch code at iterA's position, an the process repeats.

This seems very non-Scala-like and I still haven't designed a good way to save these changes into a new file.

What is a good approach to this problem? Should I ditch iterators entirely? I'd preferably like to do it without having to have the entire input or output into memory at once.

Thanks!

like image 414
Andrew Conner Avatar asked Feb 16 '12 16:02

Andrew Conner


2 Answers

You could probably implement this with Scala's Stream class. I am assuming that you don't mind holding one "batch" in memory at a time.

import scala.annotation.tailrec
import scala.io._

def isBatchLine(line:String):Boolean = ...

def batchLine(size: Int):String = ...

val it = Source.fromFile("in.ps").getLines
// cannot use it.toStream here because of SI-4835
def inLines = Stream.continually(i).takeWhile(_.hasNext).map(_.next)

// Note: using `def` instead of `val` here means we don't hold
// the entire stream in memory
def batchedLinesFrom(stream: Stream[String]):Stream[String] = {
  val (batch, remainder) = stream span { !isBatchLine(_) }
  if (batch.isEmpty && remainder.isEmpty) { 
    Stream.empty
  } else {
    batchLine(batch.size) #:: batch #::: batchedLinesFrom(remainder.drop(1))
  }
}

def newLines = batchedLinesFrom(inLines dropWhile isBatchLine)

val ps = new java.io.PrintStream(new java.io.File("out.ps"))

newLines foreach ps.println

ps.close()
like image 156
stephenjudkins Avatar answered Sep 27 '22 17:09

stephenjudkins


If you not in pursuit of functional scala enlightenment, I'd recommend a more imperative style using java.util.Scanner#findWithinHorizon. My example is quite naive, iterating through the input twice.

val scanner = new Scanner(inFile)

val writer = new BufferedWriter(...)

def loop() = {
  // you might want to limit the horizon to prevent OutOfMemoryError
  Option(scanner.findWithinHorizon(".*YOUR-BATCH-MARKER", 0)) match {
    case Some(batch) =>
      val pageCount = countPages(batch)
      writePageCount(writer, pageCount)
      writer.write(batch)        
      loop()

    case None =>
  }
}

loop()
scanner.close()
writer.close()
like image 36
MxFr Avatar answered Sep 27 '22 18:09

MxFr