Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating complex nested dictionaries from Pandas DataFrame

I'm trying to find a generic way of creating (possibly deeply) nested dictionaries from a flat Pandas DataFrame instance.

Suppose I have the following DataFrame:

dat = pd.DataFrame({'name' : ['John', 'John', 'John', 'John', 'Henry', 'Henry'],
                    'age' : [24, 24, 24, 24, 31, 31],
                    'gender' : ['Male','Male','Male','Male','Male','Male'],
                    'study' : ['Mathematics', 'Mathematics', 'Mathematics', 'Philosophy', 'Physics', 'Physics'],
                    'course' : ['Calculus 101', 'Calculus 101', 'Calculus 102', 'Aristotelean Ethics', 'Quantum mechanics', 'Quantum mechanics'],
                    'test' : ['Exam', 'Essay','Exam','Essay', 'Exam1','Exam2'],
                    'pass' : [True, True, True, True, True, True],
                    'grade' : ['A', 'A', 'B', 'A', 'C', 'C']})
dat = dat[['name', 'age', 'gender', 'study', 'course', 'test', 'grade', 'pass']] #re-order columns to better reflect data structure

I want to create a deeply nested dictionary (or list of nested dictionaries), that 'respects' the underlying structure of this data. That is, a grade is information about a test, which is part of a course, which is part of a study, that a person does. Also, age and gender are information about that same person.

An example desired output is this:

[{'John': {'age': 24,
           'gender': 'Male',
           'study': {'Mathematics': {'Calculus 101': {'Exam': {'grade': 'B',
                                                               'pass': True}}},
                     'Philosophy': {'Aristotelean Ethics': {'Essay': {'grade': 'A',
                                                                      'pass': True}}}}}},
 {'Henry': {'age': 31,
            'gender': 'Male',
            'study': {'Physics': {'Quantum mechanics': {'Exam1': {'Grade': 'C',
                                                                  'Pass': True},
                                                        'Exam2': {'Grade': 'C',
                                                                  'Pass': True}}}}}}]

(although there may be other, similar ways to structure such data).

I tried using groupby, which makes it easy, for example, to nest 'grade' and 'pass' under 'test', nest 'test' under 'course', nest 'course' under 'study', and 'study' under 'name'. But, then I don't see how to add 'gender' and 'age' under 'name' as well? Something like this is the best I came up with:

dic = {}
for ind, row in dat.groupby(['name', 'study', 'course', 'test'])['grade', 'pass']:

    #this is ugly and not very generic, but just as an example
    if not ind[0] in dic:
        dic[ind[0]] = {}
    if not ind[1] in dic[ind[0]]:
        dic[ind[0]][ind[1]] = {}
    if not ind[2] in dic[ind[0]][ind[1]]:
        dic[ind[0]][ind[1]][ind[2]] = {}
    if not ind[3] in dic[ind[0]][ind[1]][ind[2]]:
        dic[ind[0]][ind[1]][ind[2]][ind[3]] = {}

    dic[ind[0]][ind[1]][ind[2]][ind[3]]['grade'] = row['grade'].values[0]
    dic[ind[0]][ind[1]][ind[2]][ind[3]]['pass'] = row['pass'].values[0]

But in this case, 'age' and 'gender' are not nested under 'name'. I can't seem to wrap my head around how to do this...

Another option is to set a MultiIndex and make a .to_dict('index') call. But then again, I don't see how I can nest both dicts and non-dicts under a single key...

My question is similar to this one: Convert pandas DataFrame to a nested dict, but I'm looking for a more complex nesting (e.g., not just one last column which should be nested under all other columns). Most other questions on Stackoverflow ask for the reverse: creating a (possibly MultiIndex) DataFrame from a deeply nested dictionary.

Edit: The question is also similar to this q: Pandas convert Dataframe to Nested Json, but in that question, only the last column (e.g., column n) should be nested under all other columns (n-1, n-2 etc; fully recursive nesting). In my question, column n and n-1 should be nested under n-2, but column n-2 and n-3 should be nested under n-4 (thus, importantly, n-2 is not nested under n-3 but under n-4). The MultiIndex partial solution offered by Mohammad Yusuf Ghazi depicts the structure nicely.

like image 530
SMOP Avatar asked Oct 29 '22 15:10

SMOP


1 Answers

Not really concise, but it's the best I can get now:

>>> def rollup1(x):
...     return x.set_index('test')[['grade', 'pass']].to_dict(orient='index')
>>> def rollup2(x):
...     return x.groupby('course').apply(rollup1).to_dict()
>>> def rollup3(x):
...     return x.groupby('study').apply(rollup2).to_dict()

>>> df = dat.groupby(['name','age','gender']).apply(rollup3)
>>> df.name = 'study'
>>> res = df.reset_index(level=[1,2]).to_dict(orient='index')
>>> pprint.pprint(res)
{'Henry': {'age': 31L,
           'gender': 'Male',
           'study': {'Physics': {'Quantum mechanics': {'Exam1': {'grade': 'C',
                                                                 'pass': True},
                                                       'Exam2': {'grade': 'C',
                                                                 'pass': True}}}}},
 'John': {'age': 24L,
          'gender': 'Male',
          'study': {'Mathematics': {'Calculus 101': {'Essay': {'grade': 'A',
                                                               'pass': True},
                                                     'Exam': {'grade': 'A',
                                                              'pass': True}},
                                    'Calculus 102': {'Exam': {'grade': 'B',
                                                              'pass': True}}},
                    'Philosophy': {'Aristotelean Ethics': {'Essay': {'grade': 'A',
                                                                     'pass': True}}}}}}

The idea is to roll up data to dictionaries while grouping data to get 'study' column

update I've tried to create more generic solution, so it'd work for question like this one as well:

def rollup_to_dict_core(x, values, columns, d_columns=None):
    if d_columns is None:
        d_columns = []

    if len(columns) == 1:
        if len(values) == 1:
            return x.set_index(columns)[values[0]].to_dict()
        else:
            return x.set_index(columns)[values].to_dict(orient='index')
    else:
        res = x.groupby([columns[0]] + d_columns).apply(lambda y: rollup_to_dict_core(y, values, columns[1:]))
        if len(d_columns) == 0:
            return res.to_dict()
        else:
            res.name = columns[1]
            res = res.reset_index(level=range(1, len(d_columns) + 1))
            return res.to_dict(orient='index')

def rollup_to_dict(x, values, d_columns=None):
    if d_columns is None:
        d_columns = []

    columns = [c for c in x.columns if c not in values and c not in d_columns]
    return rollup_to_dict_core(x, values, columns, d_columns)

>>> pprint(rollup_to_dict(dat, ['pass', 'grade'], ['age','gender']))
{'Henry': {'age': 31L,
           'gender': 'Male',
           'study': {'Physics': {'Quantum mechanics': {'Exam1': {'grade': 'C',
                                                                 'pass': True},
                                                       'Exam2': {'grade': 'C',
                                                                 'pass': True}}}}},
 'John': {'age': 24L,
          'gender': 'Male',
          'study': {'Mathematics': {'Calculus 101': {'Essay': {'grade': 'A',
                                                               'pass': True},
                                                     'Exam': {'grade': 'A',
                                                              'pass': True}},
                                    'Calculus 102': {'Exam': {'grade': 'B',
                                                              'pass': True}}},
                    'Philosophy': {'Aristotelean Ethics': {'Essay': {'grade': 'A',
                                                                     'pass': True}}}}}}
like image 138
Roman Pekar Avatar answered Nov 09 '22 06:11

Roman Pekar