Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use ijson/other to parse this large JSON file?

I have this massive json file (8gb), and I run out of memory when trying to read it in to Python. How would I implement a similar procedure using ijson or some other library that is more efficient with large json files?

import pandas as pd

#There are (say) 1m objects - each is its json object - within in this file. 
with open('my_file.json') as json_file:      
    data = json_file.readlines()
    #So I take a list of these json objects
    list_of_objs = [obj for obj in data]

#But I only want about 200 of the json objects
desired_data = [obj for obj in list_of_objs if object['feature']=="desired_feature"]

How would I implement this using ijson or something similar? Is there a way I can extract the objects I want without reading in the whole JSON file?

The file is a list of objects like:

{
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",
    "stars": 4,
    "date": "2016-03-09",
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
    "useful": 0,
    "funny": 0,
}
like image 949
user9090553 Avatar asked Dec 10 '25 15:12

user9090553


1 Answers

The file is a list of objects

This is a little ambiguous. Looking at your code snippet it looks like your file contains separate JSON object on each line. Which is not the same as the actual JSON array that starts with [, ends with ] and has , between items.

In the case of a json-per-line file it's as easy as:

import json
from itertools import islice

with(open(filename)) as f:
    objects = (json.loads(line) for line in f)
    objects = islice(objects, 200)

Note the differences:

  • you don't need .readlines(), the file object itself is an iterable that yields individual lines
  • parentheses (..) instead of brackets [..] in (... for line in f) create a lazy generator expression instead of a Python list in memory with all the lines
  • islice(objects, 200) will give you the first 200 items without iterating further. If objects would've been a list you could just do objects[:200]

Now, if your file is actually a JSON array then you indeed need ijson:

import ijson  # or choose a faster backend if needed
from itertools import islice

with open(filename) as f:
    objects = ijson.items(f, 'item')
    objects = islice(objects, 200)

ijson.items returns a lazy iterator over a parsed array. The 'item' in the second parameter means "each item in a top-level array".

like image 141
isagalaev Avatar answered Dec 13 '25 05:12

isagalaev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!