I have a JSON file containing quite a lot of test data, which I want to parse and push through an algorithm I'm testing. It's about 30MB in size, with a list of 60,000 or so elements. I initially tried the simple parser in scala.util.parsing.json, like so:
import scala.util.parsing.json.JSON
val data = JSON.parseFull(Source.fromFile(path) mkString)
Where path is just a string containing the path the big JSON file. That chugged away for about 45 minutes, then threw this:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Someone then pointed out to me that nobody uses this library and I should use Lift's JSON parser. So I tried this in my Scala REPL:
scala> import scala.io.Source
import scala.io.Source
scala> val s = Source.fromFile("path/to/big.json")
s: scala.io.BufferedSource = non-empty iterator
scala> val data = parse(s mkString)
java.lang.OutOfMemoryError: GC overhead limit exceeded
This time it only took about 3 minutes, but the same error.
So, obviously I could break the file up into smaller ones, iterate over the directory of JSON files and merge my data together piece-by-piece, but I'd rather avoid it if possible. Does anyone have any recommendations?
For further information -- I'd been working with this same dataset the past few weeks in Clojure (for visualization with Incanter) without issues. The following works perfectly fine:
user=> (use 'clojure.data.json)
nil
user=> (use 'clojure.java.io)
nil
user=> (time (def data (read-json (reader "path/to/big.json"))))
"Elapsed time: 19401.629685 msecs"
#'user/data
Those messages indicate that the application is spending more than 98% of its time collecting garbage.
I'd suspect that Scala is generating a lot of short-lived objects, which is what is causing the excessive GCs. You can verify the GC performance by adding the -verbosegc
command line switch to java
.
The default max heap size on Java 1.5+ server VM is 1 GB (or 1/4 of installed memory, whichever is less), which should be sufficient for your purposes, but you may want to increase the new generation to see if that improves your performance. On the Oracle VM, this is done with the -Xmn
option. Try setting the following environment variable:
$JAVA_OPTS=-server -Xmx1024m -Xms1024m -Xmn2m -verbosegc -XX:+PrintGCDetails
and re-running your application.
You should also check out this tuning guide for details.
Try using Jerkson instead. Jerkson uses Jackson underneath, which repeatedly scores as the fastest and most memory efficient JSON parser on the JVM.
I've used both Lift JSON and Jerkson in production, and Jerkson's performance was signficantly better than Lift's (especially when parsing and generating large JSON documents).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With