Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I process huge JSON files as streams in Ruby, without consuming all memory?

I'm having trouble processing a huge JSON file in Ruby. What I'm looking for is a way to process it entry-by-entry without keeping too much data in memory.

I thought that yajl-ruby gem would do the work but it consumes all my memory. I've also looked at Yajl::FFI and JSON:Stream gems but there it is clearly stated:

For larger documents we can use an IO object to stream it into the parser. We still need room for the parsed object, but the document itself is never fully read into memory.

Here's what I've done with Yajl:

file_stream = File.open(file, "r")
json = Yajl::Parser.parse(file_stream)
json.each do |entry|
    entry.do_something
end
file_stream.close

The memory usage keeps getting higher until the process is killed.

I don't see why Yajl keeps processed entries in the memory. Can I somehow free them, or did I just misunderstood the capabilities of Yajl parser?

If it cannot be done using Yajl: is there a way to do this in Ruby via any library?

like image 451
thisismydesign Avatar asked Aug 25 '15 15:08

thisismydesign


People also ask

How much JSON is too much?

How large can JSON Documents be? One of the more frequently asked questions about the native JSON data type, is what size can a JSON document be. The short answer is that the maximum size is 1GB.

Is JSON parse safe Ruby?

JSON isn't code, you can't inject harmful values into it. JSON. parse is safe.

What is YAJL Ruby?

GitHub - brianmario/yajl-ruby: A streaming JSON parsing and encoding library for Ruby (C bindings to yajl)


2 Answers

Problem

json = Yajl::Parser.parse(file_stream)

When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.

Solution

Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.

The example given in the README is:

Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!

(Assume we're in an EventMachine::Connection instance)

def post_init
  @parser = Yajl::Parser.new(:symbolize_keys => true)
end

def object_parsed(obj)
  puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein"
  puts obj.inspect
end

def connection_completed
  # once a full JSON object has been parsed from the stream
  # object_parsed will be called, and passed the constructed object
  @parser.on_parse_complete = method(:object_parsed)
end

def receive_data(data)
  # continue passing chunks
  @parser << data
end

Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.

obj = Yajl::Parser.parse(str_or_io)

One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.

Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.

like image 154
Todd A. Jacobs Avatar answered Sep 25 '22 01:09

Todd A. Jacobs


Both @CodeGnome's and @A. Rager's answer helped me understand the solution.

I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.

like image 20
thisismydesign Avatar answered Sep 25 '22 01:09

thisismydesign