Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a file with BodyParser in Scala Play20 with new lines

Excuse the n00bness of this question, but I have a web application where I want to send a potentially large file to the server and have it parse the format. I'm using the Play20 framework and I'm new to Scala.

For example, if I have a csv, I'd like to split each row by "," and ultimately create a List[List[String]] with each field.

Currently, I'm thinking the best way to do this is with a BodyParser (but I could be wrong). My code looks something like:

Iteratee.fold[String, List[List[String]]]() {
  (result, chunk) =>
    result = chunk.splitByNewLine.splitByDelimiter // Psuedocode
}

My first question is, how do I deal with a situation like the one below where a chunk has been split in the middle of a line:

Chunk 1:
1,2,3,4\n
5,6

Chunk 2:
7,8\n
9,10,11,12\n

My second question is, is writing my own BodyParser the right way to go about this? Are there better ways of parsing this file? My main concern is that I want to allow the files to be very large so I can flush a buffer at some point and not keep the entire file in memory.

like image 1000
Jeff Wu Avatar asked Jun 13 '12 15:06

Jeff Wu


2 Answers

If your csv doesn't contain escaped newlines then it is pretty easy to do a progressive parsing without putting the whole file into memory. The iteratee library comes with a method search inside play.api.libs.iteratee.Parsing :

def search (needle: Array[Byte]): Enumeratee[Array[Byte], MatchInfo[Array[Byte]]]

which will partition your stream into Matched[Array[Byte]] and Unmatched[Array[Byte]]

Then you can combine a first iteratee that takes a header and another that will fold into the umatched results. This should look like the following code:

// break at each match and concat unmatches and drop the last received element (the match)
val concatLine: Iteratee[Parsing.MatchInfo[Array[Byte]],String] = 
  ( Enumeratee.breakE[Parsing.MatchInfo[Array[Byte]]](_.isMatch) ><>
    Enumeratee.collect{ case Parsing.Unmatched(bytes) => new String(bytes)} &>>
    Iteratee.consume() ).flatMap(r => Iteratee.head.map(_ => r))

// group chunks using the above iteratee and do simple csv parsing
val csvParser: Iteratee[Array[Byte], List[List[String]]] =
  Parsing.search("\n".getBytes) ><>
  Enumeratee.grouped( concatLine ) ><>
  Enumeratee.map(_.split(',').toList) &>>
  Iteratee.head.flatMap( header => Iteratee.getChunks.map(header.toList ++ _) )

// an example of a chunked simple csv file
val chunkedCsv: Enumerator[Array[Byte]] = Enumerator("""a,b,c
""","1,2,3","""
4,5,6
7,8,""","""9
""") &> Enumeratee.map(_.getBytes)

// get the result
val csvPromise: Promise[List[List[String]]] = chunkedCsv |>>> csvParser

// eventually returns List(List(a, b, c),List(1, 2, 3), List(4, 5, 6), List(7, 8, 9))

Of course you can improve the parsing. If you do, I would appreciate if you share it with the community.

So your Play2 controller would be something like:

val requestCsvBodyParser = BodyParser(rh => csvParser.map(Right(_)))

// progressively parse the big uploaded csv like file
def postCsv = Action(requestCsvBodyParser){ rq: Request[List[List[String]]] => 
  //do something with data
}
like image 96
2 revs Avatar answered Nov 14 '22 21:11

2 revs


If you don't mind holding twice the size of List[List[String]] in memory then you could use a body parser like play.api.mvc.BodyParsers.parse.tolerantText:

def toCsv = Action(parse.tolerantText) { request =>
  val data = request.body
  val reader = new java.io.StringReader(data)
  // use a Java CSV parsing library like http://opencsv.sourceforge.net/
  // to transform the text into CSV data
  Ok("Done")
}

Note that if you want to reduce memory consumption, I recommend using Array[Array[String]] or Vector[Vector[String]] depending on if you want to deal with mutable or immutable data.

If you are dealing with truly large amount of data (or lost of requests of medium size data) and your processing can be done incrementally, then you can look at rolling your own body parser. That body parser would not generate a List[List[String]] but instead parse the lines as they come and fold each line into the incremental result. But this is quite a bit more complex to do, in particular if your CSV is using double quote to support fields with commas, newlines or double quotes.

like image 40
huynhjl Avatar answered Nov 14 '22 23:11

huynhjl