Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Json file to dictionary

I am using the yelp dataset and I want to parse the review json file to a dictionary. I tried loading it on a pandas DataFrame and then creating the dictionary, but because the file is too big it is time consuming. I want to keep only the user_id and stars values. A line of the json file looks like this:

{
    "votes": {
        "funny": 0, "useful": 2, "cool": 1},
    "user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
    "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17",
    "text": (
        "dr. goldberg offers everything i look for in a general practitioner.  "
        "he's nice and easy to talk to without being patronizing; he's always on "
        "time in seeing his patients; he's affiliated with a top-notch hospital (nyu) "
        "which my parents have explained to me is very important in case something "
        "happens and you need surgery; and you can get referrals to see specialists "
        "without having to see him first.  really, what more do you need?  i'm "
        "sitting here trying to think of any complaints i have about him, but i'm "
        "really drawing a blank."
    ),
    "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}

How can i iterate over every 'field' (for the lack o a better word)? So far i can only iterate over each line.

EDIT

As requested pandas code :

reading the json

with open('yelp_academic_dataset_review.json') as f:
    df = pd.DataFrame(json.loads(line) for line in f)

Creating the dictionary

dict = {} 

for i, row in df.iterrows():
   business_id = row['business_id']
   user_id = row['user_id']
   rating = row['stars']
   key = (business_id, user_id)
   dict[key] = rating
like image 642
mnmbs Avatar asked Oct 21 '25 03:10

mnmbs


1 Answers

You don't need to read this into a DataFrame. json.load() returns a dictionary. For example:

sample.json

{
"votes": {
"funny": 0,
"useful": 2,
"cool": 1
},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}

read_json.py

import json

with open('sample.json', 'r') as fh:
    result_dict = json.load(fh)

print(result_dict['user_id'])
print(result_dict['stars'])

output

Xqd0DzHaiyRqVH3WRG7hzg
5

With that output you can easily create a DataFrame.

There are several good discussions about parsing json as a stream on SO, but the gist is it's not possible natively, although some tools seem to attempt it.

In the interest of keeping your code simple and with minimal dependencies, you might see if reading the json directory into a dictionary is a sufficient improvement.

like image 65
binarysubstrate Avatar answered Oct 23 '25 18:10

binarysubstrate