I have a JSON file which has multiple objects such as:
{"reviewerID": "bc19970fff3383b2fe947cf9a3a5d7b13b6e57ef2cd53abc52bb2dfedf5fb1cd", "asin": "a6ed402934e3c1138111dce09256538afb04c566edf37c16b9ba099d23afb764", "overall": 2.0, "helpful": {"nHelpful": 1, "outOf": 1}, "reviewText": "This remote, for whatever reason, was chosen by Time Warner to replace their previous silver remote, the Time Warner Synergy V RC-U62CP-1.12S. The actual function of this CLIKR-5 is OK, but the ergonomic design sets back remotes by 20 years. The buttons are all the same, there's no separation of the number buttons, the volume and channel buttons are the same shape as the other buttons on the remote, and it all adds up to a crappy user experience. Why would TWC accept this as a replacement? I'm skipping this and paying double for a refurbished Synergy V.", "summary": "Ergonomic nightmare", "unixReviewTime": 1397433600}
{"reviewerID": "3689286c8658f54a2ff7aa68ce589c81f6cae4c4d9de76fa0f66d5c114f79837", "asin": "8939d791e9dd035aa58da024ace69b20d651cea4adf6159d984872b44f663301", "overall": 4.0, "helpful": {"nHelpful": 21, "outOf": 22}, "reviewText": "This is a great truck GPS. I've tried others and nothing seems to come close to the Rand McNally TND-700.Excellent screen size and resolution. The audio is loud enough to be heard over road noise and the purr of my Kenworth/Cat engine. I've used it for the last 8,000 miles or so and it has only glitched once. Just restarted it and it picked up on my route right where it should have.Clean up the minor issues and this unit rates a solid 5.Rand McNally 528881469 7-inch Intelliroute TND 700 Truck GPS", "summary": "Great Unit!", "unixReviewTime": 1280016000}
I am trying to convert it to a Pandas DataFrame using the following code:
train_df = pd.DataFrame()
count = 0;
for l in open('train.json'):
try:
count +=1
if(count==20001):
break
obj1 = json.loads(l)
df1=pd.DataFrame(obj1, index=[0])
train_df = train_df.append(df1, ignore_index=True)
except ValueError:
line = line.replace('\\','')
obj = json.loads(line)
df1=pd.DataFrame(obj, index=[0])
train_df = train_df.append(df1, ignore_index=True)
However, it gives me 'NaN' for nested values i.e. 'helpful' attribute. I want the output such that both the keys of the nested attribute are a separate column.
EDIT:
P.S: I am using try/except because I have '\' character in some objects which gives me a JSON decode error.
Can anyone help? Is there any other approach I can use?
Thank You.
Read JSON File into DataFrame You can convert JSON to Pandas DataFrame by simply using read_json() . Just pass JSON string to the function. It takes multiple parameters, for our case I am using orient that specifies the format of JSON string. This function is also used to read JSON files into pandas DataFrame.
To read a JSON file via Pandas, we'll utilize the read_json() method and pass it the path to the file we'd like to read. The method returns a Pandas DataFrame that stores data in the form of columns and rows.
Pandas Load JSON into the DataFrameStep 1: You need to create a JSON file that contains JSON strings. Step 2: Save the file with extension . json to create a JSON file. Step 3: Load the JSON file in Pandas using the command below.
Parse JSON - Convert from JSON to Python If you have a JSON string, you can parse it by using the json.loads() method. The result will be a Python dictionary.
Use json_normalize
on the list of dictionaries which performs reasonably faster on large number of json objects.
from pandas.io.json import json_normalize
my_list = []
with open('train.json') as f:
for line in f:
line = line.replace('\\','')
my_list.append(json.loads(line))
# avoid transposing if you want to keep keys as columns of the dataframe
result_df = json_normalize(my_list).T
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With