I am running a cron job on my amazon EC2 micro instance every 12 hours. It downloads 118MB file and parses it using the json library. This of course makes the instance run out of memory. My instance has 416MB of memory free but then I run the script it drops down to 6 MB and then it is killed by OS.
I am wondering what are my options here? Is it possible to parse this efficiently via Ruby or do I have to drop down to low level stuff like C? I can get a more capable amazon instance but I really want to know if it's possible to do this via Ruby.
UPDATE: I have looked at yajl. It can give you json objects as it parses, but the problem is, if your JSON file contains only 1 root object, then it will be forced to parse ALL the file. My JSON looks like this:
--Root
-Obj 1
-Obj 2
-Obj 3
So if I do:
parser.parse(file) do |hash|
#do something here
end
Since I only have 1 root object, it will parse the entire JSON. If Obj 1/2/3 were root, then it would work as it will give me them one by one, but my JSON isn't like that and it parses and eats up 500mb of memory...
UPDATE # 2: Here's a smaller version of the large 118mb file (7mb):
GONE
It's parseable, I didn't just take some bytes off from the file, just so you an see it as a whole. The array I am looking for is this
events = json['resultsPage']['results']['event']
Thanks
YAJL implements a streaming parser. You can use it to read your JSON on-the-fly, so you can operate on the contents as they come in, then discard them (and the generated data structures from them) after you're done with them. If you're clever about it, this'll keep you under your memory limits.
Edit: With your data, you are really interested in pulling out portions of the JSON object at a time, rather than parsing the whole object. This is significantly trickier, and really requires that you implement your own parser. The nuts and bolts of it are that you want to:
This won't work with yajl, since you are dealing with one object here, rather than multiple objects. To make it work with yajl, you're going to need to parse the JSON manually to discover the event object boundaries, then pass each event object chunk to a JSON parser for deserialization. Something like Ragel could simplify this process for you.
Of course, it would be easier to just upgrade your AWS instance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With