Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a memory efficient and fast way to load big JSON files?

I have some json files with 500MB. If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.

Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.

like image 349
duduklein Avatar asked Mar 08 '10 10:03

duduklein


People also ask

Is JSON memory efficient?

If you look at our large JSON file, it contains characters that don't fit in ASCII. Because it's loaded as one giant string, that whole giant string uses a less efficient memory representation.


2 Answers

There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.

Update:

I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:

import ijson for prefix, the_type, value in ijson.parse(open(json_file_name)):     print prefix, the_type, value 

where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.

The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.

like image 63
Jim Pivarski Avatar answered Sep 21 '22 06:09

Jim Pivarski


So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:

  1. Modularize your code. Do something like:

    for json_file in list_of_files:     process_file(json_file) 

    If you write process_file() in such a way that it doesn't rely on any global state, and doesn't change any global state, the garbage collector should be able to do its job.

  2. Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a program that parses just one, and pass each one in from a shell script, or from another python process that calls your script via subprocess.Popen. This is a little less elegant, but if nothing else works, it will ensure that you're not holding on to stale data from one file to the next.

Hope this helps.

like image 25
jcdyer Avatar answered Sep 22 '22 06:09

jcdyer