Fast Parsing a huge 12 GB JSON file with Python

Question

I have a 12GB JSON file that every line contains information about a scientific paper. This is how it looks

enter image description here

I want to parse it and create 3 pandas dataframes that contain information about venues, authors and how many times an author has published in a venue. Bellow you can see the code I have written. My problem is that this code needs many days in order to run. Is there a way to make it faster?

venues = pd.DataFrame(columns = ['id', 'raw', 'type'])
authors = pd.DataFrame(columns = ['id','name'])
main = pd.DataFrame(columns = ['author_id','venue_id','number_of_times'])
with open(r'C:\Users\dintz\Documents	est.json',encoding='UTF-8') as infile:
    papers = ijson.items(infile, 'item')
    for paper in papers:
        if 'id' not in paper["venue"]:
            if 'type' not in paper["venue"]:
                venues = venues.append({'raw': paper["venue"]["raw"]},ignore_index=True)
            else:
                venues = venues.append({'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
        else:
            venues = venues.append({'id': paper["venue"]["id"] , 'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
        paper_authors = paper["authors"]
        paper_authors_json = json.dumps(paper_authors)
        obj = ijson.items(paper_authors_json,'item')
        for author in obj:
            authors = authors.append({'id': author["id"] , 'name': author["name"]},ignore_index=True)
            main = main.append({'author_id': author["id"] , 'venue_raw': venues.iloc[-1]['raw'],'number_of_times': 1},ignore_index=True)

authors = authors.drop_duplicates(subset=None, keep='first', inplace=False)
venues = venues.drop_duplicates(subset=None, keep='first', inplace=False)
main = main.groupby(by=['author_id','venue_raw'], axis=0, as_index = False).sum()

Tagar · Accepted Answer

Apache Spark allows to read json files in multiple chunks in parallel to make it faster - https://spark.apache.org/docs/latest/sql-data-sources-json.html

For a regular multi-line JSON file, set the multiLine parameter to True.

If you're not familiar with Spark, you can use Pandas-compatible layer on top of Spark that is called Koalas -

https://koalas.readthedocs.io/en/latest/

Koalas read_json call - https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_json.html

David Chen · Answer

Your use wrong tool to accomplish this task, do not use pandas for this scenario. Lets look at the last 3 lines of code, it is simple and clean, but how to fill these data into pandas dataframe is not so easy, when you can not use pandas input function such as read_json() or read_csv().

I prefer use pure python for this simple task, if your PC has sufficient memory, use dict to get a unique authors and venues, use itertools.groupby to grouping and use more_itertools.ilen to calculate the count.

authors = {}
venues = {}
for paper in papers:
    venues[paper["venue"]["id"]] = (paper["venue"]["raw"], paper["venue"]["type"])
for author in obj:
    authors[author["id"]] = author["name"]

Fast Parsing a huge 12 GB JSON file with Python

Tags:

python

json

python-3.x

Δημήτριος Ιντζελερ

2 Answers

Tagar

David Chen

Recent Activity

Donate For Us

Fast Parsing a huge 12 GB JSON file with Python

Tags:

python

json

python-3.x

Δημήτριος Ιντζελερ

2 Answers

Tagar

David Chen

Related questions

Recent Activity

Donate For Us