Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the proper way to parse a very large JSON file in Ruby?

Tags:

json

ruby

How we could parse a json file in Ruby?

require 'json'

JSON.parse File.read('data.json')

What if the file is very large and we don't want to load it into memory at once? How would we parse it then?

like image 902
Alexander Popov Avatar asked Dec 24 '22 03:12

Alexander Popov


1 Answers

Since you said don't want to load it into memory at once, maybe doing this by chunks is more suitable for you. You can check yajl-ffi gem to achieve this. From their documantation:

For larger documents, we can use an IO object to stream it into the parser. We still need room for the parsed object, but the document itself is never fully read into memory.

require 'yajl/ffi'
stream = File.open('/tmp/test.json')
obj = Yajl::FFI::Parser.parse(stream)

However, when streaming small documents from disk, or over the network, the yajl-ruby gem will give us the best performance.

Huge documents arriving over the network in small chunks to an EventMachine receive_data loop is where Yajl::FFI is uniquely suited. Inside an EventMachine::Connection subclass we might have:

def post_init
  @parser = Yajl::FFI::Parser.new
  @parser.start_document { puts "start document" }
  @parser.end_document   { puts "end document" }
  @parser.start_object   { puts "start object" }
  @parser.end_object     { puts "end object" }
  @parser.start_array    { puts "start array" }
  @parser.end_array      { puts "end array" }
  @parser.key            { |k| puts "key: #{k}" }
  @parser.value          { |v| puts "value: #{v}" }
end

def receive_data(data)
  begin
    @parser << data
  rescue Yajl::FFI::ParserError => e
    close_connection
  end
end

The parser accepts chunks of the JSON document and parses up to the end of the available buffer. Passing in more data resumes the parse from the prior state. When an interesting state change happens, the parser notifies all registered callback procs of the event.

The event callback is where we can do interesting data filtering and passing to other processes. The above example simply prints state changes, but the callbacks might look for an array named rows and process sets of these row objects in small batches. Millions of rows, streaming over the network, can be processed in constant memory space this way.

like image 50
mdegis Avatar answered Feb 23 '23 19:02

mdegis