I'm having trouble processing a huge JSON file in Ruby. What I'm looking for is a way to process it entry-by-entry without keeping too much data in memory.
I thought that yajl-ruby gem would do the work but it consumes all my memory. I've also looked at Yajl::FFI and JSON:Stream gems but there it is clearly stated:
For larger documents we can use an IO object to stream it into the parser. We still need room for the parsed object, but the document itself is never fully read into memory.
Here's what I've done with Yajl:
file_stream = File.open(file, "r")
json = Yajl::Parser.parse(file_stream)
json.each do |entry|
entry.do_something
end
file_stream.close
The memory usage keeps getting higher until the process is killed.
I don't see why Yajl keeps processed entries in the memory. Can I somehow free them, or did I just misunderstood the capabilities of Yajl parser?
If it cannot be done using Yajl: is there a way to do this in Ruby via any library?
How large can JSON Documents be? One of the more frequently asked questions about the native JSON data type, is what size can a JSON document be. The short answer is that the maximum size is 1GB.
JSON isn't code, you can't inject harmful values into it. JSON. parse is safe.
GitHub - brianmario/yajl-ruby: A streaming JSON parsing and encoding library for Ruby (C bindings to yajl)
json = Yajl::Parser.parse(file_stream)
When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.
Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.
The example given in the README is:
Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!
(Assume we're in an EventMachine::Connection instance)
def post_init @parser = Yajl::Parser.new(:symbolize_keys => true) end def object_parsed(obj) puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein" puts obj.inspect end def connection_completed # once a full JSON object has been parsed from the stream # object_parsed will be called, and passed the constructed object @parser.on_parse_complete = method(:object_parsed) end def receive_data(data) # continue passing chunks @parser << data end
Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.
obj = Yajl::Parser.parse(str_or_io)
One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.
Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.
Both @CodeGnome's and @A. Rager's answer helped me understand the solution.
I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With