Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a large json in pandas?

My code is :data_review=pd.read_json('review.json') I have the data review as fllow:

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

But I got the follow error:

    333             fh, handles = _get_handle(filepath_or_buffer, 'r',
    334                                       encoding=encoding)
--> 335             json = fh.read()
    336             fh.close()
    337         else:

OSError: [Errno 22] Invalid argument

My jsonfile do not contain any comments and 3.8G! I just download the file from here to practice link

When I use the follow code,throw the same error:

import json
with open('review.json') as json_file:
    data = json.load(json_file)
like image 310
ileadall42 Avatar asked Oct 17 '17 12:10

ileadall42


1 Answers

Using the arg lines=True and chunksize=X will create a reader that get specific number of lines.

Then you have to make a loop to display each chunk.

Here is a piece of code for you to understand :

import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
    print(chunk)
    break

Chunks create a multiple of chunks according to the lenght of your json (talking in lines). For example, I have a 100 000 lines json with X objects in it, if I do chunksize = 10 000, I will have 10 chunks.

In the code that I gave I added a break in order to just print the first chunk but if you remove it, you will have 10 chunks one by one.

like image 138
Max Avatar answered Oct 24 '22 09:10

Max