I have this massive json file (8gb), and I run out of memory when trying to read it in to Python. How would I implement a similar procedure using ijson or some other library that is more efficient with large json files?
import pandas as pd
#There are (say) 1m objects - each is its json object - within in this file.
with open('my_file.json') as json_file:
data = json_file.readlines()
#So I take a list of these json objects
list_of_objs = [obj for obj in data]
#But I only want about 200 of the json objects
desired_data = [obj for obj in list_of_objs if object['feature']=="desired_feature"]
How would I implement this using ijson or something similar? Is there a way I can extract the objects I want without reading in the whole JSON file?
The file is a list of objects like:
{
"review_id": "zdSx_SD6obEhz9VrW9uAWA",
"user_id": "Ha3iJu77CxlrFm-vQRs_8g",
"business_id": "tnhfDv5Il8EaGSXZGiuQGg",
"stars": 4,
"date": "2016-03-09",
"text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
"useful": 0,
"funny": 0,
}
The file is a list of objects
This is a little ambiguous. Looking at your code snippet it looks like your file contains separate JSON object on each line. Which is not the same as the actual JSON array that starts with [, ends with ] and has , between items.
In the case of a json-per-line file it's as easy as:
import json
from itertools import islice
with(open(filename)) as f:
objects = (json.loads(line) for line in f)
objects = islice(objects, 200)
Note the differences:
.readlines(), the file object itself is an iterable that yields individual lines(..) instead of brackets [..] in (... for line in f) create a lazy generator expression instead of a Python list in memory with all the linesislice(objects, 200) will give you the first 200 items without iterating further. If objects would've been a list you could just do objects[:200]Now, if your file is actually a JSON array then you indeed need ijson:
import ijson # or choose a faster backend if needed
from itertools import islice
with open(filename) as f:
objects = ijson.items(f, 'item')
objects = islice(objects, 200)
ijson.items returns a lazy iterator over a parsed array. The 'item' in the second parameter means "each item in a top-level array".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With