Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to parse 100mb of JSON payload

I am running a cron job on my amazon EC2 micro instance every 12 hours. It downloads 118MB file and parses it using the json library. This of course makes the instance run out of memory. My instance has 416MB of memory free but then I run the script it drops down to 6 MB and then it is killed by OS.

I am wondering what are my options here? Is it possible to parse this efficiently via Ruby or do I have to drop down to low level stuff like C? I can get a more capable amazon instance but I really want to know if it's possible to do this via Ruby.

UPDATE: I have looked at yajl. It can give you json objects as it parses, but the problem is, if your JSON file contains only 1 root object, then it will be forced to parse ALL the file. My JSON looks like this:

--Root
   -Obj 1
   -Obj 2
   -Obj 3

So if I do:

parser.parse(file) do |hash|
  #do something here
end

Since I only have 1 root object, it will parse the entire JSON. If Obj 1/2/3 were root, then it would work as it will give me them one by one, but my JSON isn't like that and it parses and eats up 500mb of memory...

UPDATE # 2: Here's a smaller version of the large 118mb file (7mb):

GONE

It's parseable, I didn't just take some bytes off from the file, just so you an see it as a whole. The array I am looking for is this

events = json['resultsPage']['results']['event']

Thanks

like image 518
0xSina Avatar asked Dec 21 '12 16:12

0xSina


1 Answers

YAJL implements a streaming parser. You can use it to read your JSON on-the-fly, so you can operate on the contents as they come in, then discard them (and the generated data structures from them) after you're done with them. If you're clever about it, this'll keep you under your memory limits.

Edit: With your data, you are really interested in pulling out portions of the JSON object at a time, rather than parsing the whole object. This is significantly trickier, and really requires that you implement your own parser. The nuts and bolts of it are that you want to:

  1. Step into the events array
  2. For each event in the array, parse the event
  3. Pass the parsed event into some callback function
  4. Discard the parsed event and source input to free memory for the next event.

This won't work with yajl, since you are dealing with one object here, rather than multiple objects. To make it work with yajl, you're going to need to parse the JSON manually to discover the event object boundaries, then pass each event object chunk to a JSON parser for deserialization. Something like Ragel could simplify this process for you.

Of course, it would be easier to just upgrade your AWS instance.

like image 126
Chris Heald Avatar answered Nov 01 '22 09:11

Chris Heald