I have a nested dictionary, whereby the sub-dictionary use lists:
nested_dict = {'string1': {69: [1231, 232], 67:[682, 12], 65: [1, 1]}, 
    `string2` :{28672: [82, 23], 22736:[82, 93, 1102, 102], 19423: [64, 23]}, ... }
There are at least two elements in the list for the sub-dictionaries, but there could be more.
I would like to "unfold" this dictionary into a pandas DataFrame, with one column for the first dictionary keys (e.g. 'string1', 'string2', ..), one column for the sub-directory keys, one column for the first item in the list, one column for the next item, and so on.
Here is what the output should look like:
col1       col2    col3     col4    col5    col6
string1    69      1231     232
string1    67      682      12
string1    65      1        1
string2    28672   82       23
string2    22736   82       93      1102    102
string2    19423   64       23
Naturally, I try to use pd.DataFrame.from_dict:
new_df = pd.DataFrame.from_dict({(i,j): nested_dict[i][j] 
                           for i in nested_dict.keys() 
                           for j in nested_dict[i].keys()
                           ... 
Now I'm stuck. And there are many existing problems:
How do I parse the strings (i.e. the nested_dict[i].values()) such that each element is a new pandas DataFrame column? 
The above will actually not create a column for each field
The above will not fill up the columns with elements, e.g. string1 should be in each row for the sub-directory key-value pair. (For col5 and col6, I can fill the NA with zeros)
I'm not sure how to name these columns correctly.
This should give you the result you are looking for, although it's probably not the most elegant solution. There's probably a better (more pandas way) to do it.
I parsed your nested dict and built a list of dictionaries (one for each row).
# some sample input
nested_dict = {
    'string1': {69: [1231, 232], 67:[682, 12], 65: [1, 1]}, 
    'string2' :{28672: [82, 23], 22736:[82, 93, 1102, 102], 19423: [64, 23]},
    'string3' :{28673: [83, 24], 22737:[83, 94, 1103, 103], 19424: [65, 24]}
}
# new list is what we will use to hold each row
new_list = []
for k1 in nested_dict:
    curr_dict = nested_dict[k1]
    for k2 in curr_dict:
        new_dict = {'col1': k1, 'col2': k2}
        new_dict.update({'col%d'%(i+3): curr_dict[k2][i] for i in range(len(curr_dict[k2]))})
        new_list.append(new_dict)
# create a DataFrame from new list
df = pd.DataFrame(new_list)
The output:
      col1   col2  col3  col4    col5   col6
0  string2  28672    82    23     NaN    NaN
1  string2  22736    82    93  1102.0  102.0
2  string2  19423    64    23     NaN    NaN
3  string3  19424    65    24     NaN    NaN
4  string3  28673    83    24     NaN    NaN
5  string3  22737    83    94  1103.0  103.0
6  string1     65     1     1     NaN    NaN
7  string1     67   682    12     NaN    NaN
8  string1     69  1231   232     NaN    NaN
There is an assumption that the input will always contain enough data to create a col1 and a col2.
I loop through nested_dict. It is assumed that each element of nested_dict is also a dictionary. We loop through that dictionary as well (curr_dict). The keys k1 and k2 are used to populate col1 and col2. For the rest of the keys, we iterate through the list contents and add a column for each element.
Here's a method which uses a recursive generator to unroll the nested dictionaries. It won't assume that you have exactly two levels, but continues unrolling each dict until it hits a list.
nested_dict = {
    'string1': {69: [1231, 232], 67:[682, 12], 65: [1, 1]}, 
    'string2' :{28672: [82, 23], 22736:[82, 93, 1102, 102], 19423: [64, 23]},
    'string3': [101, 102]}
def unroll(data):
    if isinstance(data, dict):
        for key, value in data.items():
            # Recursively unroll the next level and prepend the key to each row.
            for row in unroll(value):
                yield [key] + row
    if isinstance(data, list):
        # This is the bottom of the structure (defines exactly one row).
        yield data
df = pd.DataFrame(list(unroll(nested_dict)))
Because unroll produces a list of lists rather than dicts, the columns will be named numerically (from 0 to 5 in this case). So you need to use rename to get the column labels you want:
df.rename(columns=lambda i: 'col{}'.format(i+1))
This returns the following result (note that the additional string3 entry is also unrolled).
      col1   col2  col3   col4    col5   col6
0  string1     69  1231  232.0     NaN    NaN
1  string1     67   682   12.0     NaN    NaN
2  string1     65     1    1.0     NaN    NaN
3  string2  28672    82   23.0     NaN    NaN
4  string2  22736    82   93.0  1102.0  102.0
5  string2  19423    64   23.0     NaN    NaN
6  string3    101   102    NaN     NaN    NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With