I have a JSON file as follows:
{
"ALPHA": [
{
"date": "2021-06-22",
"constituents": {
"BBB": 0,
"EEE": 1,
"BTB": 1,
"YUY": 1
}
},
{
"date": "2021-09-07",
"constituents": {
"BBB": 0,
"EEE": 0,
"BTB": 0,
"YUY": 0
}
}
],
"BETA": [
{
"date": "2021-06-22",
"constituents": {
"BBB": 1,
"EEE": 1,
"BTB": 1,
"YUY": 1
}
},
{
"date": "2021-09-07",
"constituents": {
"BBB": 1,
"EEE": 1,
"BTB": 1,
"YUY": 1
}
}
],
"THETA": [
{
"date": "2021-06-22",
"constituents": {
"BBB": 0,
"EEE": 1,
"BTB": 1,
"YUY": 0
}
},
{
"date": "2021-08-20",
"constituents": {
"BBB": 0,
"EEE": 1,
"BTB": 1,
"YUY": 0
}
},
{
"date": "2021-09-07",
"constituents": {
"BBB": 0,
"EEE": 1,
"BTB": 1,
"YUY": 0
}
}
]
}
I want to read the above into a pandas data frame where the first index is the date, the second index is the first keys (i.e. "ALPHA", "BETA", "THETA"), the columns are the inner keys (i.e. "BBB" ,"EEE", "BTB" ,"YUY"), and the cell values are the values of these inner keys.
How can I read that into pandas from the JSON file?
Reading JSON Files using Pandas To read the files, we use read_json() function and through it, we pass the path to the JSON file we want to read. Once we do that, it returns a “DataFrame”( A table of rows and columns) that stores data.
You can convert JSON to Pandas DataFrame by simply using read_json() . Just pass JSON string to the function. It takes multiple parameters, for our case I am using orient that specifies the format of JSON string. This function is also used to read JSON files into pandas DataFrame.
Convert JSON to Dataframe using read_json() method. In Python, the Pandas module provides a method read_json() to convert JSON to a Dataframe. It can read the JSON contents from a file or use a JSON string directly and transform them into the dataframe.
Pandas json_normalize () can do most of the work when working with nested data from a JSON file. However, it flattens the entire nested data when your goal might actually be to extract one value. For example, to extract the property math from the following JSON file.
To read a JSON file via Pandas, we can use the read_json () method. The result looks great. L et’s take a look at the data types with df.info (). By default, columns that are numerical are cast to numeric types, for example, the math, physics, and chemistry columns have been cast to int64. 2.
Pandas json_normalize () can do most of the work when working with nested data from a JSON file. However, it flattens the entire nested data when your goal might actually be to extract one value. For example, to extract the property math from the following JSON file. How can we do that more effectively? The answer is using read_json with glom.
Extracting a single value from deeply nested JSON Pandas json_normalize () can do most of the work when working with nested data from a JSON file. However, it flattens the entire nested data when your goal might actually be to extract one value.
You can use pd.Series
to import the JSON to a Pandas series with ALPHA
, BETA
as index and elements as list. Then expand the list of JSON to individual JSON by .explode()
. Expand the inner JSON to dataframe by .apply()
+ pd.Series
.
Append date
as index by .set_index()
with append=True
; swap date
from second index to first index by .swaplevel()
.
Finally, take the column constituents
and further expand the inner JSON to dataframe by .apply()
+ pd.Series
, as follows:
(assume you have already loaded the JSON file into j
)
df = (pd.Series(j)
.explode()
.apply(pd.Series)
.set_index('date', append=True)
.swaplevel()['constituents']
.apply(pd.Series)
)
Data input:
j = {'ALPHA': [{'date': '2021-06-22',
'constituents': {'BBB': 0, 'EEE': 1, 'BTB': 1, 'YUY': 1}},
{'date': '2021-09-07',
'constituents': {'BBB': 0, 'EEE': 0, 'BTB': 0, 'YUY': 0}}],
'BETA': [{'date': '2021-06-22',
'constituents': {'BBB': 1, 'EEE': 1, 'BTB': 1, 'YUY': 1}},
{'date': '2021-09-07',
'constituents': {'BBB': 1, 'EEE': 1, 'BTB': 1, 'YUY': 1}}],
'THETA': [{'date': '2021-06-22',
'constituents': {'BBB': 0, 'EEE': 1, 'BTB': 1, 'YUY': 0}},
{'date': '2021-08-20',
'constituents': {'BBB': 0, 'EEE': 1, 'BTB': 1, 'YUY': 0}},
{'date': '2021-09-07',
'constituents': {'BBB': 0, 'EEE': 1, 'BTB': 1, 'YUY': 0}}]}
Output
print(df)
BBB EEE BTB YUY
date
2021-06-22 ALPHA 0 1 1 1
2021-09-07 ALPHA 0 0 0 0
2021-06-22 BETA 1 1 1 1
2021-09-07 BETA 1 1 1 1
2021-06-22 THETA 0 1 1 0
2021-08-20 THETA 0 1 1 0
2021-09-07 THETA 0 1 1 0
I feel you get better performance, and potentially easier manipulation, if you deal with python native data structures outside Pandas, before pulling the final form into Pandas:
Let's flatten the nested dictionary into a single dictionary, using Python's tools:
container = []
for key, value in j.items(): # j is the main dictionary
for entry in value:
content = {'date': entry['date'],
'key': key,
# expand the nested constituent data
# this gets us a single dictionary
**entry['constituents']}
container.append(content)
print(container)
[{'date': '2021-06-22',
'key': 'ALPHA',
'BBB': 0,
'EEE': 1,
'BTB': 1,
'YUY': 1},
{'date': '2021-09-07',
'key': 'ALPHA',
'BBB': 0,
'EEE': 0,
'BTB': 0,
'YUY': 0},
{'date': '2021-06-22', 'key': 'BETA', 'BBB': 1, 'EEE': 1, 'BTB': 1, 'YUY': 1},
{'date': '2021-09-07', 'key': 'BETA', 'BBB': 1, 'EEE': 1, 'BTB': 1, 'YUY': 1},
{'date': '2021-06-22',
'key': 'THETA',
'BBB': 0,
'EEE': 1,
'BTB': 1,
'YUY': 0},
{'date': '2021-08-20',
'key': 'THETA',
'BBB': 0,
'EEE': 1,
'BTB': 1,
'YUY': 0},
{'date': '2021-09-07',
'key': 'THETA',
'BBB': 0,
'EEE': 1,
'BTB': 1,
'YUY': 0}]
Now, build the dataframe, and set the required columns as index:
pd.DataFrame(container).set_index(['date', 'key'])
BBB EEE BTB YUY
date key
2021-06-22 ALPHA 0 1 1 1
2021-09-07 ALPHA 0 0 0 0
2021-06-22 BETA 1 1 1 1
2021-09-07 BETA 1 1 1 1
2021-06-22 THETA 0 1 1 0
2021-08-20 THETA 0 1 1 0
2021-09-07 THETA 0 1 1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With