Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading rather large JSON files [duplicate]

Tags:

python

json

Possible Duplicate:
Is there a memory efficient and fast way to load big JSON files?

So I have some rather large json encoded files. The smallest is 300MB, but this is by far the smallest. The rest are multiple GB, anywhere from around 2GB to 10GB+.

So I seem to run out of memory when trying to load the file with Python. I'm currently just running some tests to see roughly how long dealing with this stuff is going to take to see where to go from here. Here is the code I'm using to test:

from datetime import datetime import json  print datetime.now()  f = open('file.json', 'r') json.load(f) f.close()  print datetime.now() 

Not too surprisingly, Python gives me a MemoryError. It appears that json.load() calls json.loads(f.read()), which is trying to dump the entire file into memory first, which clearly isn't going to work.

Any way I can solve this cleanly?

I know this is old, but I don't think this is a duplicate. While the answer is the same, the question is different. In the "duplicate", the question is how to read large files efficiently, whereas this question deals with files that won't even fit in to memory at all. Efficiency isn't required.

like image 715
Tom Carrick Avatar asked Apr 30 '12 10:04

Tom Carrick


People also ask

How do I read a large JSON file in Python?

To load big JSON files in a memory efficient and fast way with Python, we can use the ijson library. We call ijson. parse to parse the file opened by open . Then we print the key prefix , data type of the JSON value store in the_type , and the value of the entry with the given key prefix .

How big is too big for a JSON response?

One of the more frequently asked questions about the native JSON data type, is what size can a JSON document be. The short answer is that the maximum size is 1GB.


1 Answers

The issue here is that JSON, as a format, is generally parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.

The solution to this is to work with the data as a stream - reading part of the file, working with it, and then repeating.

The best option appears to be using something like ijson - a module that will work with JSON as a stream, rather than as a block file.

Edit: Also worth a look - kashif's comment about json-streamer and Henrik Heino's comment about bigjson.

like image 175
Gareth Latty Avatar answered Sep 21 '22 08:09

Gareth Latty