I have some json files with 500MB. If I use the "trivial" json.load()
to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
If you look at our large JSON file, it contains characters that don't fit in ASCII. Because it's loaded as one giant string, that whole giant string uses a less efficient memory representation.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson for prefix, the_type, value in ijson.parse(open(json_file_name)): print prefix, the_type, value
where prefix
is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType
describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array'
, and value
is the value of the object or None
if the_type
is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py
to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files: process_file(json_file)
If you write process_file()
in such a way that it doesn't rely on any global state, and doesn't change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a program that parses just one, and pass each one in from a shell script, or from another python process that calls your script via subprocess.Popen
. This is a little less elegant, but if nothing else works, it will ensure that you're not holding on to stale data from one file to the next.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With