I fetch about 20,000 datasets from a server in 1,000 batches. Each dataset is a JSON object. Persisted this makes around 350 MB of uncompressed plaintext. I have a memory limit of 1GB. Hence, I write each 1,000 JSON objects as an array into a raw JSON file in append mode. The result is a file with 20 JSON arrays which needs to be aggregated. I need to touch them anyway, because I want to add metadata. Generally the Ruby Yajl Parser makes this possible like so: <pre class="prettyprint"><code>raw_file = File.new(path_to_raw_file, 'r') json_file = File.new(path_to_json_file, 'w') datasets = [] parser = Yajl::Parser.new parser.on_parse_complete = Proc.new { |o| datasets += o } parser.parse(datasets) hash = { date: Time.now, datasets: datasets } Yajl::Encoder.encode(hash, json_file) </code></pre> Where is the problem with this solution? The problem is that still the whole JSON is parsed into memory, which I must avoid. Basically what I need is a solution which parses the JSON from an IO object and encodes them to another IO object, at the same time. I assumed Yajl offers this, but I haven't found a way, nor did its API give any hints, so I guess not. Is there a JSON Parser library which supports this? Are there other solutions? <hr> The only solution I can think of is to use the <code>IO.seek</code> capabilities. Write all the datasets arrays one after another <code>[...][...][...]</code> and after every array, I seek back to the start and overwrite <code>][</code> with <code>,</code>, effectively connecting the arrays manually.

Why can't you retrieve a single record at a time from the database, process it as necessary, convert it to JSON, then emit it with a trailing/delimiting comma? If you started with a file that only contained <code>[</code>, then appended all your JSON strings, then, on the final entry didn't append a comma, and instead used a closing <code>]</code>, you'd have a JSON array of hashes, and would only have to process one row's worth at a time. It'd be a tiny bit slower (maybe) but wouldn't impact your system. And DB I/O can be very fast if you use blocking/paging to retrieve a reasonable number of records at a time. For instance, here's a combination of some Sequel example code, and code to extract the rows as JSON and build a larger JSON structure: <pre class="prettyprint"><code>require 'json' require 'sequel' DB = Sequel.sqlite # memory database DB.create_table :items do primary_key :id String :name Float :price end items = DB[:items] # Create a dataset # Populate the table items.insert(:name => 'abc', :price => rand * 100) items.insert(:name => 'def', :price => rand * 100) items.insert(:name => 'ghi', :price => rand * 100) add_comma = false puts '[' items.order(:price).each do |item| puts ',' if add_comma add_comma ||= true print JSON[item] end puts "\n]" </code></pre> Which outputs: <pre class="prettyprint"><code>[ {"id":2,"name":"def","price":3.714714089426208}, {"id":3,"name":"ghi","price":27.0179624376119}, {"id":1,"name":"abc","price":52.51248221170203} ] </code></pre> Notice the order is now by "price". Validation is easy: <pre class="prettyprint"><code>require 'json' require 'pp' pp JSON[<<EOT] [ {"id":2,"name":"def","price":3.714714089426208}, {"id":3,"name":"ghi","price":27.0179624376119}, {"id":1,"name":"abc","price":52.51248221170203} ] EOT </code></pre> Which results in: <pre class="prettyprint"><code>[{"id"=>2, "name"=>"def", "price"=>3.714714089426208}, {"id"=>3, "name"=>"ghi", "price"=>27.0179624376119}, {"id"=>1, "name"=>"abc", "price"=>52.51248221170203}] </code></pre> This validates the JSON and demonstrates that the original data is recoverable. Each row retrieved from the database should be a minimal "bitesized" piece of the overall JSON structure you want to build. Building upon that, here's how to read incoming JSON in the database, manipulate it, then emit it as a JSON file: <pre class="prettyprint"><code>require 'json' require 'sequel' DB = Sequel.sqlite # memory database DB.create_table :items do primary_key :id String :json end items = DB[:items] # Create a dataset # Populate the table items.insert(:json => JSON[:name => 'abc', :price => rand * 100]) items.insert(:json => JSON[:name => 'def', :price => rand * 100]) items.insert(:json => JSON[:name => 'ghi', :price => rand * 100]) items.insert(:json => JSON[:name => 'jkl', :price => rand * 100]) items.insert(:json => JSON[:name => 'mno', :price => rand * 100]) items.insert(:json => JSON[:name => 'pqr', :price => rand * 100]) items.insert(:json => JSON[:name => 'stu', :price => rand * 100]) items.insert(:json => JSON[:name => 'vwx', :price => rand * 100]) items.insert(:json => JSON[:name => 'yz_', :price => rand * 100]) add_comma = false puts '[' items.each do |item| puts ',' if add_comma add_comma ||= true print JSON[ JSON[ item[:json] ].merge('foo' => 'bar', 'time' => Time.now.to_f) ] end puts "\n]" </code></pre> Which generates: <pre class="prettyprint"><code>[ {"name":"abc","price":3.268814929005337,"foo":"bar","time":1379688093.124606}, {"name":"def","price":13.871147312377719,"foo":"bar","time":1379688093.124664}, {"name":"ghi","price":52.720984131655676,"foo":"bar","time":1379688093.124702}, {"name":"jkl","price":53.21477190840114,"foo":"bar","time":1379688093.124732}, {"name":"mno","price":40.99364022416619,"foo":"bar","time":1379688093.124758}, {"name":"pqr","price":5.918738444452265,"foo":"bar","time":1379688093.124803}, {"name":"stu","price":45.09391752439902,"foo":"bar","time":1379688093.124831}, {"name":"vwx","price":63.08947792357426,"foo":"bar","time":1379688093.124862}, {"name":"yz_","price":94.04921035056373,"foo":"bar","time":1379688093.124894} ] </code></pre> I added the timestamp so you can see that each row is processed individually, AND to give you an idea how fast the rows are being processed. Granted, this is a tiny, in-memory database, which has no network I/O to content with, but a normal network connection through a switch to a database on a reasonable DB host should be pretty fast too. Telling the ORM to read the DB in chunks can speed up the processing because the DBM will be able to return larger blocks to more efficiently fill the packets. You'll have to experiment to determine what size chunks you need because it will vary based on your network, your hosts, and the size of your records. Your original design isn't good when dealing with enterprise-sized databases, especially when your hardware resources are limited. Over the years we've learned how to parse BIG databases, which make 20,000 row tables appear miniscule. VM slices are common these days and we use them for crunching, so they're often the PCs of yesteryear: single CPU with small memory footprints and dinky drives. We can't beat them up or they'll be bottlenecks, so we have to break the data into the smallest atomic pieces we can. Harping about DB design: Storing JSON in a database is a questionable practice. DBMs these days can spew JSON, YAML and XML representations of rows, but forcing the DBM to search inside stored JSON, YAML or XML strings is a major hit in processing speed, so avoid it at all costs unless you also have the equivalent lookup data indexed in separate fields so your searches are at the highest possible speed. If the data is available in separate fields, then doing good ol' database queries, tweaking in the DBM or your scripting language of choice, and emitting the massaged data becomes a lot easier.

Stream based parsing and writing of JSON

Tags:

json

io

memory

parsing

ruby

I fetch about 20,000 datasets from a server in 1,000 batches. Each dataset is a JSON object. Persisted this makes around 350 MB of uncompressed plaintext.

I have a memory limit of 1GB. Hence, I write each 1,000 JSON objects as an array into a raw JSON file in append mode.

The result is a file with 20 JSON arrays which needs to be aggregated. I need to touch them anyway, because I want to add metadata. Generally the Ruby Yajl Parser makes this possible like so:

raw_file = File.new(path_to_raw_file, 'r')
json_file = File.new(path_to_json_file, 'w')

datasets = []
parser = Yajl::Parser.new
parser.on_parse_complete = Proc.new { |o| datasets += o }

parser.parse(datasets)

hash = { date: Time.now, datasets: datasets }
Yajl::Encoder.encode(hash, json_file)

Where is the problem with this solution? The problem is that still the whole JSON is parsed into memory, which I must avoid.

Basically what I need is a solution which parses the JSON from an IO object and encodes them to another IO object, at the same time.

I assumed Yajl offers this, but I haven't found a way, nor did its API give any hints, so I guess not. Is there a JSON Parser library which supports this? Are there other solutions?

The only solution I can think of is to use the IO.seek capabilities. Write all the datasets arrays one after another [...][...][...] and after every array, I seek back to the start and overwrite ][ with ,, effectively connecting the arrays manually.

255

asked Sep 19 '13 18:09

Guarana Joe

1 Answers

Why can't you retrieve a single record at a time from the database, process it as necessary, convert it to JSON, then emit it with a trailing/delimiting comma?

If you started with a file that only contained [, then appended all your JSON strings, then, on the final entry didn't append a comma, and instead used a closing ], you'd have a JSON array of hashes, and would only have to process one row's worth at a time.

It'd be a tiny bit slower (maybe) but wouldn't impact your system. And DB I/O can be very fast if you use blocking/paging to retrieve a reasonable number of records at a time.

For instance, here's a combination of some Sequel example code, and code to extract the rows as JSON and build a larger JSON structure:

require 'json'
require 'sequel'

DB = Sequel.sqlite # memory database

DB.create_table :items do
  primary_key :id
  String :name
  Float :price
end

items = DB[:items] # Create a dataset

# Populate the table
items.insert(:name => 'abc', :price => rand * 100)
items.insert(:name => 'def', :price => rand * 100)
items.insert(:name => 'ghi', :price => rand * 100)

add_comma = false

puts '['
items.order(:price).each do |item|
  puts ',' if add_comma
  add_comma ||= true
  print JSON[item]
end
puts "\n]"

Which outputs:

[
{"id":2,"name":"def","price":3.714714089426208},
{"id":3,"name":"ghi","price":27.0179624376119},
{"id":1,"name":"abc","price":52.51248221170203}
]

Notice the order is now by "price".

Validation is easy:

require 'json'
require 'pp'

pp JSON[<<EOT]
[
{"id":2,"name":"def","price":3.714714089426208},
{"id":3,"name":"ghi","price":27.0179624376119},
{"id":1,"name":"abc","price":52.51248221170203}
]
EOT

Which results in:

[{"id"=>2, "name"=>"def", "price"=>3.714714089426208},
 {"id"=>3, "name"=>"ghi", "price"=>27.0179624376119},
 {"id"=>1, "name"=>"abc", "price"=>52.51248221170203}]

This validates the JSON and demonstrates that the original data is recoverable. Each row retrieved from the database should be a minimal "bitesized" piece of the overall JSON structure you want to build.

Building upon that, here's how to read incoming JSON in the database, manipulate it, then emit it as a JSON file:

require 'json'
require 'sequel'

DB = Sequel.sqlite # memory database

DB.create_table :items do
  primary_key :id
  String :json
end

items = DB[:items] # Create a dataset

# Populate the table
items.insert(:json => JSON[:name => 'abc', :price => rand * 100])
items.insert(:json => JSON[:name => 'def', :price => rand * 100])
items.insert(:json => JSON[:name => 'ghi', :price => rand * 100])
items.insert(:json => JSON[:name => 'jkl', :price => rand * 100])
items.insert(:json => JSON[:name => 'mno', :price => rand * 100])
items.insert(:json => JSON[:name => 'pqr', :price => rand * 100])
items.insert(:json => JSON[:name => 'stu', :price => rand * 100])
items.insert(:json => JSON[:name => 'vwx', :price => rand * 100])
items.insert(:json => JSON[:name => 'yz_', :price => rand * 100])

add_comma = false

puts '['
items.each do |item|
  puts ',' if add_comma
  add_comma ||= true
  print JSON[
    JSON[
      item[:json]
    ].merge('foo' => 'bar', 'time' => Time.now.to_f)
  ]
end
puts "\n]"

Which generates:

[
{"name":"abc","price":3.268814929005337,"foo":"bar","time":1379688093.124606},
{"name":"def","price":13.871147312377719,"foo":"bar","time":1379688093.124664},
{"name":"ghi","price":52.720984131655676,"foo":"bar","time":1379688093.124702},
{"name":"jkl","price":53.21477190840114,"foo":"bar","time":1379688093.124732},
{"name":"mno","price":40.99364022416619,"foo":"bar","time":1379688093.124758},
{"name":"pqr","price":5.918738444452265,"foo":"bar","time":1379688093.124803},
{"name":"stu","price":45.09391752439902,"foo":"bar","time":1379688093.124831},
{"name":"vwx","price":63.08947792357426,"foo":"bar","time":1379688093.124862},
{"name":"yz_","price":94.04921035056373,"foo":"bar","time":1379688093.124894}
]

I added the timestamp so you can see that each row is processed individually, AND to give you an idea how fast the rows are being processed. Granted, this is a tiny, in-memory database, which has no network I/O to content with, but a normal network connection through a switch to a database on a reasonable DB host should be pretty fast too. Telling the ORM to read the DB in chunks can speed up the processing because the DBM will be able to return larger blocks to more efficiently fill the packets. You'll have to experiment to determine what size chunks you need because it will vary based on your network, your hosts, and the size of your records.

Your original design isn't good when dealing with enterprise-sized databases, especially when your hardware resources are limited. Over the years we've learned how to parse BIG databases, which make 20,000 row tables appear miniscule. VM slices are common these days and we use them for crunching, so they're often the PCs of yesteryear: single CPU with small memory footprints and dinky drives. We can't beat them up or they'll be bottlenecks, so we have to break the data into the smallest atomic pieces we can.

Harping about DB design: Storing JSON in a database is a questionable practice. DBMs these days can spew JSON, YAML and XML representations of rows, but forcing the DBM to search inside stored JSON, YAML or XML strings is a major hit in processing speed, so avoid it at all costs unless you also have the equivalent lookup data indexed in separate fields so your searches are at the highest possible speed. If the data is available in separate fields, then doing good ol' database queries, tweaking in the DBM or your scripting language of choice, and emitting the massaged data becomes a lot easier.

165

answered Sep 22 '22 01:09

the Tin Man

Related questions
                            
                                Can't install libv8 gem on Cygwin
                            
                                Javascript string compression and PHP/Ruby decompression
                            
                                Sinatra/Ruby default a parameter
                            
                                IRB - Ruby 1.9.x hash syntax: {if: true} is not equal to {:if => true}
                            
                                Why does Thor feature the no_tasks method?
                            
                                Writing a complex case statement in Sequel?
                            
                                Feedback on Ruby / ChefSpec coding style
                            
                                Variable getting initialized with nil
                            
                                Stub multipart requests with webmock/rspec
                            
                                Rails/Ruby pain - How to check if gem is UNIX/UNIX-like based?
                            
                                Newlines resolved as =0A in Sendgrid X-SMTPAPI header
                            
                                Why is split(' ') trying to be (too) smart?
                            
                                Trying to install Jekyll on Windows 8 (x64): Error installing fast-stemmer-1.0.2.gem
                            
                                How does Ruby's sort method work with the combined comparison (spaceship) operator?
                            
                                ActiveSupport::DescendantsTracker.descendants not returning descendants
                            
                                What's wrong with the Square and Rectangle inheritance?
                            
                                What is the purpose of the Enumerator class in Ruby
                            
                                Ruby: Binary String to IO
                            
                                ruby 2.0 undefined method ObjectSpace.trace_object_allocations
                            
                                StaleElementReference Error Element not found in the cache

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With