Suppose I have a nested dictionary 'user_dict' with structure:
For example, an entry of this dictionary would be:
user_dict[12] = {
"Category 1": {"att_1": 1,
"att_2": "whatever"},
"Category 2": {"att_1": 23,
"att_2": "another"}}
each item in user_dict
has the same structure and user_dict
contains a large number of items which I want to feed to a pandas DataFrame, constructing the series from the attributes. In this case a hierarchical index would be useful for the purpose.
Specifically, my question is whether there exists a way to to help the DataFrame constructor understand that the series should be built from the values of the "level 3" in the dictionary?
If I try something like:
df = pandas.DataFrame(users_summary)
The items in "level 1" (the UserId's) are taken as columns, which is the opposite of what I want to achieve (have UserId's as index).
I know I could construct the series after iterating over the dictionary entries, but if there is a more direct way this would be very useful. A similar question would be asking whether it is possible to construct a pandas DataFrame from json objects listed in a file.
We first take the list of nested dictionary and extract the rows of data from it. Then we create another for loop to append the rows into the new list which was originally created empty. Finally we apply the DataFrames function in the pandas library to create the Data Frame.
You can convert a dictionary to Pandas Dataframe using df = pd. DataFrame. from_dict(my_dict) statement.
Pandas Series is a 1-dimensional array like object which can hold data of any type. You can create a pandas series from a dictionary by passing the dictionary to the command: pandas. Series() .
A pandas MultiIndex consists of a list of tuples. So the most natural approach would be to reshape your input dict so that its keys are tuples corresponding to the multi-index values you require. Then you can just construct your dataframe using pd.DataFrame.from_dict
, using the option orient='index'
:
user_dict = {12: {'Category 1': {'att_1': 1, 'att_2': 'whatever'},
'Category 2': {'att_1': 23, 'att_2': 'another'}},
15: {'Category 1': {'att_1': 10, 'att_2': 'foo'},
'Category 2': {'att_1': 30, 'att_2': 'bar'}}}
pd.DataFrame.from_dict({(i,j): user_dict[i][j]
for i in user_dict.keys()
for j in user_dict[i].keys()},
orient='index')
att_1 att_2
12 Category 1 1 whatever
Category 2 23 another
15 Category 1 10 foo
Category 2 30 bar
An alternative approach would be to build your dataframe up by concatenating the component dataframes:
user_ids = []
frames = []
for user_id, d in user_dict.iteritems():
user_ids.append(user_id)
frames.append(pd.DataFrame.from_dict(d, orient='index'))
pd.concat(frames, keys=user_ids)
att_1 att_2
12 Category 1 1 whatever
Category 2 23 another
15 Category 1 10 foo
Category 2 30 bar
pd.concat
accepts a dictionary. With this in mind, it is possible to improve upon the currently accepted answer in terms of simplicity and performance by use a dictionary comprehension to build a dictionary mapping keys to sub-frames.
pd.concat({k: pd.DataFrame(v).T for k, v in user_dict.items()}, axis=0)
Or,
pd.concat({
k: pd.DataFrame.from_dict(v, 'index') for k, v in user_dict.items()
},
axis=0)
att_1 att_2
12 Category 1 1 whatever
Category 2 23 another
15 Category 1 10 foo
Category 2 30 bar
So I used to use a for loop for iterating through the dictionary as well, but one thing I've found that works much faster is to convert to a panel and then to a dataframe. Say you have a dictionary d
import pandas as pd
d
{'RAY Index': {datetime.date(2014, 11, 3): {'PX_LAST': 1199.46,
'PX_OPEN': 1200.14},
datetime.date(2014, 11, 4): {'PX_LAST': 1195.323, 'PX_OPEN': 1197.69},
datetime.date(2014, 11, 5): {'PX_LAST': 1200.936, 'PX_OPEN': 1195.32},
datetime.date(2014, 11, 6): {'PX_LAST': 1206.061, 'PX_OPEN': 1200.62}},
'SPX Index': {datetime.date(2014, 11, 3): {'PX_LAST': 2017.81,
'PX_OPEN': 2018.21},
datetime.date(2014, 11, 4): {'PX_LAST': 2012.1, 'PX_OPEN': 2015.81},
datetime.date(2014, 11, 5): {'PX_LAST': 2023.57, 'PX_OPEN': 2015.29},
datetime.date(2014, 11, 6): {'PX_LAST': 2031.21, 'PX_OPEN': 2023.33}}}
The command
pd.Panel(d)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 4 (minor_axis)
Items axis: RAY Index to SPX Index
Major_axis axis: PX_LAST to PX_OPEN
Minor_axis axis: 2014-11-03 to 2014-11-06
where pd.Panel(d)[item] yields a dataframe
pd.Panel(d)['SPX Index']
2014-11-03 2014-11-04 2014-11-05 2014-11-06
PX_LAST 2017.81 2012.10 2023.57 2031.21
PX_OPEN 2018.21 2015.81 2015.29 2023.33
You can then hit the command to_frame() to turn it into a dataframe. I use reset_index as well to turn the major and minor axis into columns rather than have them as indices.
pd.Panel(d).to_frame().reset_index()
major minor RAY Index SPX Index
PX_LAST 2014-11-03 1199.460 2017.81
PX_LAST 2014-11-04 1195.323 2012.10
PX_LAST 2014-11-05 1200.936 2023.57
PX_LAST 2014-11-06 1206.061 2031.21
PX_OPEN 2014-11-03 1200.140 2018.21
PX_OPEN 2014-11-04 1197.690 2015.81
PX_OPEN 2014-11-05 1195.320 2015.29
PX_OPEN 2014-11-06 1200.620 2023.33
Finally, if you don't like the way the frame looks you can use the transpose function of panel to change the appearance before calling to_frame() see documentation here http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Panel.transpose.html
Just as an example
pd.Panel(d).transpose(2,0,1).to_frame().reset_index()
major minor 2014-11-03 2014-11-04 2014-11-05 2014-11-06
RAY Index PX_LAST 1199.46 1195.323 1200.936 1206.061
RAY Index PX_OPEN 1200.14 1197.690 1195.320 1200.620
SPX Index PX_LAST 2017.81 2012.100 2023.570 2031.210
SPX Index PX_OPEN 2018.21 2015.810 2015.290 2023.330
Hope this helps.
This solution should work for arbitrary depth by flattening dictionary keys to a tuple chain
def flatten_dict(nested_dict):
res = {}
if isinstance(nested_dict, dict):
for k in nested_dict:
flattened_dict = flatten_dict(nested_dict[k])
for key, val in flattened_dict.items():
key = list(key)
key.insert(0, k)
res[tuple(key)] = val
else:
res[()] = nested_dict
return res
def nested_dict_to_df(values_dict):
flat_dict = flatten_dict(values_dict)
df = pd.DataFrame.from_dict(flat_dict, orient="index")
df.index = pd.MultiIndex.from_tuples(df.index)
df = df.unstack(level=-1)
df.columns = df.columns.map("{0[1]}".format)
return df
In case someone wants to get the data frame in a "long format" (leaf values have the same type) without multiindex, you can do this:
pd.DataFrame.from_records(
[
(level1, level2, level3, leaf)
for level1, level2_dict in user_dict.items()
for level2, level3_dict in level2_dict.items()
for level3, leaf in level3_dict.items()
],
columns=['UserId', 'Category', 'Attribute', 'value']
)
UserId Category Attribute value
0 12 Category 1 att_1 1
1 12 Category 1 att_2 whatever
2 12 Category 2 att_1 23
3 12 Category 2 att_2 another
4 15 Category 1 att_1 10
5 15 Category 1 att_2 foo
6 15 Category 2 att_1 30
7 15 Category 2 att_2 bar
(I know the original question probably wants (I.) to have Levels 1 and 2 as multiindex and Level 3 as columns and (II.) asks about other ways than iteration over values in the dict. But I hope this answer is still relevant and useful (I.): to people like me who have tried to find a way to get the nested dict into this shape and google only returns this question and (II.): because other answers involve some iteration as well and I find this approach flexible and easy to read; not sure about performance, though.)
For other ways to represent the data, you don't need to do much. For example, if you just want the "outer" key to be an index, the "inner" key to be columns and the values to be cell values, this would do the trick:
df = pd.DataFrame.from_dict(user_dict, orient='index')
Building on verified answer, for me this worked best:
ab = pd.concat({k: pd.DataFrame(v).T for k, v in data.items()}, axis=0)
ab.T
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With