I would like to convert a JSON to Pandas dataframe.
My JSON looks like: like:
{
"country1":{
"AdUnit1":{
"floor_price1":{
"feature1":1111,
"feature2":1112
},
"floor_price2":{
"feature1":1121
}
},
"AdUnit2":{
"floor_price1":{
"feature1":1211
},
"floor_price2":{
"feature1":1221
}
}
},
"country2":{
"AdUnit1":{
"floor_price1":{
"feature1":2111,
"feature2":2112
}
}
}
}
I read the file from GCP using this code:
project = Context.default().project_id
sample_bucket_name = 'my_bucket'
sample_bucket_path = 'gs://' + sample_bucket_name
print('Object: ' + sample_bucket_path + '/json_output.json')
sample_bucket = storage.Bucket(sample_bucket_name)
sample_bucket.create()
sample_bucket.exists()
sample_object = sample_bucket.object('json_output.json')
list(sample_bucket.objects())
json = sample_object.read_stream()
My goal to get Pandas dataframe which looks like:
I tried using json_normalize, but didn't succeed.
You can convert JSON to pandas DataFrame by using json_normalize() , read_json() and from_dict() functions. Some of these methods are also used to extract data from JSON files and store them as DataFrame. JSON stands for JavaScript object notation . JSON is used for sharing data between servers and web applications.
pandas read_json() function can be used to read JSON file or string into DataFrame. It supports JSON in several formats by using orient param. JSON is shorthand for JavaScript Object Notation which is the most used file format that is used to exchange data between two systems or web applications.
Parse JSON - Convert from JSON to Python If you have a JSON string, you can parse it by using the json.loads() method. The result will be a Python dictionary.
Nested JSONs are always quite tricky to handle correctly.
A few months ago, I figured out a way to provide an "universal answer" using the beautifully written flatten_json_iterative_solution from here: which unpacks iteratively each level of a given json.
Then one can simply transform it to a Pandas.Series then Pandas.DataFrame like so:
df = pd.Series(flatten_json_iterative_solution(dict(json_))).to_frame().reset_index()
Intermediate Dataframe result
Some data transformation can easily be performed to split the index in the columns names you asked for:
df[["index", "col1", "col2", "col3", "col4"]] = df['index'].apply(lambda x: pd.Series(x.split('_')))
Final result
You could use this:
def flatten_dict(d):
""" Returns list of lists from given dictionary """
l = []
for k, v in sorted(d.items()):
if isinstance(v, dict):
flatten_v = flatten_dict(v)
for my_l in reversed(flatten_v):
my_l.insert(0, k)
l.extend(flatten_v)
elif isinstance(v, list):
for l_val in v:
l.append([k, l_val])
else:
l.append([k, v])
return l
This function receives a dictionary (including nesting where values could also be lists) and flattens it to a list of lists.
Then, you can simply:
df = pd.DataFrame(flatten_dict(my_dict))
Where my_dict
is your JSON object.
Taking your example, what you get when you run print(df)
is:
0 1 2 3 4
0 country1 AdUnit1 floor_price1 feature1 1111
1 country1 AdUnit1 floor_price1 feature2 1112
2 country1 AdUnit1 floor_price2 feature1 1121
3 country1 AdUnit2 floor_price1 feature1 1211
4 country1 AdUnit2 floor_price2 feature1 1221
5 country2 AdUnit1 floor_price1 feature1 2111
6 country2 AdUnit1 floor_price1 feature2 2112
And when you create the dataframe, you can name your columns and index
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With