Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a Large JSON Array To File

I am working on a process that will likely end up attempting to serialize very large json arrays to file. So loading up the entire array in memory and just dumping to file won't work. I need to stream the individual items to file to avoid out of memory issues.

Surprisingly, I can't find any examples of doing this. The code snippet below is something I've cobbled together. Is there a better way to do this?

first_item = True
with open('big_json_array.json', 'w') as out:
     out.write('[')
     for item in some_very_big_iterator:
          if first_item:
               out.write(json.dumps(item))
               first_item = False
          else:
               out.write("," + json.dumps(item))
     out.write("]")
like image 980
user163757 Avatar asked Sep 02 '18 13:09

user163757


1 Answers

While your code is reasonable, it can be improved upon. You have two reasonable options, and an additional suggestion.

Your options are:

  • To not generate an array, but to generate JSON Lines output. For each item in your generator this writes a single valid JSON document without newline characters into the file, followed by a newline. This is easy to generate with the default json.dump(item, out) configuration followed by a `out.write('\n'). You end up with a file with a separate JSON document on each line.

    The advantages are that you don't have to worry about memory issues when writing or when reading the data again, as you'd otherwise have bigger problems on reading the data from the file later on; the json module can't be made to load data iteratively, not without manually skipping the initial [ and commas.

    Reading JSON lines is simple, see Loading and parsing a JSON file with multiple JSON objects in Python

  • Wrap your data in a generator and a list subclass to make json.dump() accept it as a list, then write the data iteratively. I'll outline this below. Take into account that you now may have to solve the problem in reverse, reading the JSON data again with Python.

My suggestion is to not use JSON here. JSON was never designed for large-scale data sets, it's a web interchange format aimed at much smaller payloads. There are better formats for this sort of data exchange. If you can't deviate from JSON, then at least use JSON lines.

You can write a JSON array iteratively using a generator, and a list subclass to fool the json library into accepting it as a list:

class IteratorAsList(list):
    def __init__(self, it):
        self.it = it
    def __iter__(self):
        return self.it
    def __len__(self):
        return 1

with open('big_json_array.json', 'w') as out:
    json.dump(IteratorAsList(some_very_big_iterator), out)

The IteratorAsList class satisfies two tests that the json encoder makes: that the object is a list or subclass thereof, and that it has a length greater than 0; when those conditions are met it'll iterate over the list (using __iter__) and encode each object.

The json.dump() function then writes to the file as the encoder yields data chunks; it'll never hold all the output in memory.

like image 98
Martijn Pieters Avatar answered Oct 19 '22 03:10

Martijn Pieters