Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Parsing a huge 12 GB JSON file with Python

I have a 12GB JSON file that every line contains information about a scientific paper. This is how it looks

enter image description here

I want to parse it and create 3 pandas dataframes that contain information about venues, authors and how many times an author has published in a venue. Bellow you can see the code I have written. My problem is that this code needs many days in order to run. Is there a way to make it faster?

venues = pd.DataFrame(columns = ['id', 'raw', 'type'])
authors = pd.DataFrame(columns = ['id','name'])
main = pd.DataFrame(columns = ['author_id','venue_id','number_of_times'])
with open(r'C:\Users\dintz\Documents\test.json',encoding='UTF-8') as infile:
    papers = ijson.items(infile, 'item')
    for paper in papers:
        if 'id' not in paper["venue"]:
            if 'type' not in paper["venue"]:
                venues = venues.append({'raw': paper["venue"]["raw"]},ignore_index=True)
            else:
                venues = venues.append({'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
        else:
            venues = venues.append({'id': paper["venue"]["id"] , 'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
        paper_authors = paper["authors"]
        paper_authors_json = json.dumps(paper_authors)
        obj = ijson.items(paper_authors_json,'item')
        for author in obj:
            authors = authors.append({'id': author["id"] , 'name': author["name"]},ignore_index=True)
            main = main.append({'author_id': author["id"] , 'venue_raw': venues.iloc[-1]['raw'],'number_of_times': 1},ignore_index=True)

authors = authors.drop_duplicates(subset=None, keep='first', inplace=False)
venues = venues.drop_duplicates(subset=None, keep='first', inplace=False)
main = main.groupby(by=['author_id','venue_raw'], axis=0, as_index = False).sum()
like image 640
Δημήτριος Ιντζελερ Avatar asked Nov 28 '25 11:11

Δημήτριος Ιντζελερ


2 Answers

Apache Spark allows to read json files in multiple chunks in parallel to make it faster - https://spark.apache.org/docs/latest/sql-data-sources-json.html

For a regular multi-line JSON file, set the multiLine parameter to True.

If you're not familiar with Spark, you can use Pandas-compatible layer on top of Spark that is called Koalas -

https://koalas.readthedocs.io/en/latest/

Koalas read_json call - https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_json.html

like image 116
Tagar Avatar answered Nov 30 '25 01:11

Tagar


Your use wrong tool to accomplish this task, do not use pandas for this scenario. Lets look at the last 3 lines of code, it is simple and clean, but how to fill these data into pandas dataframe is not so easy, when you can not use pandas input function such as read_json() or read_csv().

I prefer use pure python for this simple task, if your PC has sufficient memory, use dict to get a unique authors and venues, use itertools.groupby to grouping and use more_itertools.ilen to calculate the count.

authors = {}
venues = {}
for paper in papers:
    venues[paper["venue"]["id"]] = (paper["venue"]["raw"], paper["venue"]["type"])
for author in obj:
    authors[author["id"]] = author["name"]
like image 40
David Chen Avatar answered Nov 30 '25 01:11

David Chen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!