Let's say I have the following DataFrame, where the data
column contains a nested JSON string that I want to parse into separate columns:
import pandas as pd
df = pd.DataFrame({
'bank_account': [101, 102, 201, 301],
'data': [
'{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
'{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0}, "fees": {"monthly": 0}}, "user_name": "Alice"}',
'{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Bob"}',
'{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Carol"}'
]},
index = ['Alice', 'Alice', 'Bob', 'Carol']
)
df
I've found the json_normalize
function, and am currently parsing the JSON in a list comprehension; the result is correct, but this takes long. 1000 rows take 1-2 seconds, and I have about a million rows in my real run:
import json
from pandas.io.json import json_normalize
parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])
parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['user_id']
parsed_df
Is there a faster way to parse this data into a nice-looking DataFrame?
Parsing of JSON Dataset using pandas is much more convenient. Pandas allow you to convert a list of lists into a Dataframe and specify the column names separately.
Reading JSON Files using Pandas To read the files, we use read_json() function and through it, we pass the path to the JSON file we want to read. Once we do that, it returns a “DataFrame”( A table of rows and columns) that stores data.
So you should expect to spend 2 or 3 seconds parsing one gigabyte of JSON data.
I see a small (~25%) performance improvement from bypassing pandas.concat
.
Otherwise, rewriting / optimizing json_normalize
doesn't seem straightforward.
def original(df):
parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])
parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['uid']
return parsed_df
def jp(df):
cols = ['account_data.currency.current', 'account_data.currency.minimum',
'account_data.fees.monthly', 'account_type', 'uid', 'user_name']
parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']],
columns=cols)
parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['uid']
return parsed_df
df = pd.concat([df]*100, ignore_index=True)
%timeit original(df) # 675 ms per loop
%timeit jp(df) # 526 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With