Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scanning a HUGE JSON file for deserializable data in Scala

I need to be able to process large JSON files, instantiating objects from deserializable sub-strings as we are iterating-over/streaming-in the file.

For example:

Let's say I can only deserialize into instances of the following:

case class Data(val a: Int, val b: Int, val c: Int)

and the expected JSON format is:

{   "foo": [ {"a": 0, "b": 0, "c": 0 }, {"a": 0, "b": 0, "c": 1 } ], 
    "bar": [ {"a": 1, "b": 0, "c": 0 }, {"a": 1, "b": 0, "c": 1 } ], 
     .... MANY ITEMS .... , 
    "qux": [ {"a": 0, "b": 0, "c": 0 }  }

What I would like to do is:

import com.codahale.jerkson.Json
val dataSeq : Seq[Data] = Json.advanceToValue("foo").stream[Data](fileStream)
// NOTE: this will not compile since I pulled the "advanceToValue" out of thin air.

As a final note, I would prefer to find a solution that involves Jerkson or any other libraries that comes with the Play framework, but if another Scala library handles this scenario with greater ease and decent performance: I'm not opposed to trying another library. If there is a clean way of manually seeking through the file and then using a Json library to continue parsing from there: I'm fine with that.

What I do not want to do is ingest the entire file without streaming or using an iterator, as keeping the entire file in memory at a time would be prohibitively expensive.

like image 318
Ryan Delucchi Avatar asked Jan 16 '13 18:01

Ryan Delucchi


1 Answers

I have not done it with JSON (and I hope someone will come up with a turnkey solution for you) but done it with XML and here is a way of handling it.

It is basically a simple Map->Reduce process with the help of stream parser.

Map (your advanceTo)

Use a streaming parser like JSON Simple (not tested). When on the callback you match your "path", collect anything below by writing it to a stream (file backed or in-memory, depending on your data). That will be your foo array in your example. If your mapper is sophisticated enough, you may want to collect multiple paths during the map step.

Reduce (your stream[Data])

Since the streams you collected above look pretty small, you probably do not need to map/split them again and you can parse them directly in memory as JSON objects/arrays and manipulate them (transform, recombine, etc...).

like image 111
Bruno Grieder Avatar answered Sep 30 '22 20:09

Bruno Grieder