Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert dict constructor to Pandas MultiIndex dataframe

I have a lot of data that I'd like to structure in a Pandas dataframe. However, I need a multi-index format for this. The Pandas MultiIndex feature has always confused me and also this time I can't get my head around it.

I built the structure as I want it as a dict, but because my actual data is much larger, I want to use Pandas instead. The code below is the dict variant. Note that the original data has a lot more labels and more rows as well.

The idea is that the original data contains rows of a task with index Task_n that has been performed by a participant with index Participant_n. Each row is a segment. Even though the original data does not have this distinction, I want to add this to my dataframe. In other words:

Participant_n | Task_n | val | dur
----------------------------------
            1 |      1 |  12 |   2
            1 |      1 |   3 |   4
            1 |      1 |   4 |  12
            1 |      2 |  11 |  11
            1 |      2 |  34 |   4

The above example contains one participants, two tasks, with respectively three and two segments (rows).

In Python, with a dict structure this looks like this:

import pandas as pd

cols = ['Participant_n', 'Task_n', 'val', 'dur']

data = [[1,1,25,83],
        [1,1,4,68],
        [1,1,9,987],
        [1,2,98,98],
        [1,2,84,4],
        [2,1,9,21],
        [2,2,15,6],
        [2,2,185,6],
        [2,2,18,4],
        [2,3,8,12],
        [3,1,7,78],
        [3,1,12,88],
        [3,2,12,48]]

d = pd.DataFrame(data, columns=cols)

part_d = {}
for row in d.itertuples():
    participant_n = row.Participant_n
    participant = "participant" + str(participant_n)
    task = "task" + str(row.Task_n)

    if participant in part_d:
        part_d[participant]['all_sum']['val'] += int(row.val)
        part_d[participant]['all_sum']['dur'] += int(row.dur)
    else:
        part_d[participant] = {
            'prof': 0 if participant_n < 20 else 1,
            'all_sum': {
                'val': int(row.val),
                'dur': int(row.dur),
            }
        }

    if task in part_d[participant]:
        # Get already existing keys
        k = list(part_d[participant][task].keys())

        k_int = []
        # Only get the ints (i.e. not all_sum etc.)
        for n in k:
            # Get digit from e.g. seg1
            n = n[3:]
            try:
                k_int.append(int(n))
            except ValueError:
                pass

        # Increment max by 1
        i = max(k_int) + 1
        part_d[participant][task][f"seg{i}"] = {
            'val': int(row.val),
            'dur': int(row.dur),
        }
        part_d[participant][task]['task_sum']['val'] += int(row.val)
        part_d[participant][task]['task_sum']['dur'] += int(row.dur)
    else:
        part_d[participant][task] = {
            'seg1': {
                'val': int(row.val),
                'dur': int(row.dur),
            },
            'task_sum': {
                'val': int(row.val),
                'dur': int(row.dur),
            }
        }

print(part_d)

In the end result here I have some additional variables such as: task_sum (the sum over the task of a participant), all_sum (sum of all a participant's actions), and also prof which is an arbitrary boolean flag. The resulting dict looks like this (not beautified to save space. If you want to inspect, open in text editor as JSON or Python dict and beautify):

{'participant1': {'prof': 0, 'all_sum': {'val': 220, 'dur': 1240}, 'task1': {'seg1': {'val': 25, 'dur': 83}, 'task_sum': {'val': 38, 'dur': 1138}, 'seg2': {'val': 4, 'dur': 68}, 'seg3': {'val': 9, 'dur': 987}}, 'task2': {'seg1': {'val': 98, 'dur': 98}, 'task_sum': {'val': 182, 'dur': 102}, 'seg2': {'val': 84, 'dur': 4}}}, 'participant2': {'prof': 0, 'all_sum': {'val': 235, 'dur': 49}, 'task1': {'seg1': {'val': 9, 'dur': 21}, 'task_sum': {'val': 9, 'dur': 21}}, 'task2': {'seg1': {'val': 15, 'dur': 6}, 'task_sum': {'val': 218, 'dur': 16}, 'seg2': {'val': 185, 'dur': 6}, 'seg3': {'val': 18, 'dur': 4}}, 'task3': {'seg1': {'val': 8, 'dur': 12}, 'task_sum': {'val': 8, 'dur': 12}}}, 'participant3': {'prof': 0, 'all_sum': {'val': 31, 'dur': 214}, 'task1': {'seg1': {'val': 7, 'dur': 78}, 'task_sum': {'val': 19, 'dur': 166}, 'seg2': {'val': 12, 'dur': 88}}, 'task2': {'seg1': {'val': 12, 'dur': 48}, 'task_sum': {'val': 12, 'dur': 48}}}}

Instead of a dictionary, I would like this to end up in a pd.DataFrame with multiple indexes that looks like the representation below, or similar. (For simplicity's sake, instead of task1 or seg1 I just used the indices.)

Participant   Prof all_sum      Task    Task_sum     Seg   val   dur
                   val    dur           val    dur
====================================================================
participant1  0    220   1240      1     38   1138     1    25    83
                                                       2     4    68
                                                       3     9   987
                                   2    182    102     1    98    98
                                                       2    84     4
--------------------------------------------------------------------
participant2  0    235     49      1      9     21     1     9    21
                                   2    218     16     1    15     6
                                                       2   185     6
                                                       3    18     4
                                   3      8     12     1     8    12
--------------------------------------------------------------------
participant3  0     31    214      1     19    166     1     7    78
                                                       2    12    88
                                   2     12     48     1    12    48

Is this a structure that is possible in Pandas? If not, which reasonable alternatives are?

Again I have to emphasise that in reality there is a lot more data and possibly more sub-levels. The solution thus has to be flexible, and efficient. If it makes things a lot simpler, I am willing to only have multi-index on one axis, and change the header to:

Participant  Prof  all_sum_val  all_sum_dur  Task  Task_sum_val  Task_sum_dur  Seg   

The main issue I am having is that I do not understand how I can build a multi index df if I don't know the dimensions in advance. I don't know in advance how many tasks or segments there will be. So I am pretty sure I can keep the loop construct from my initial dict approach and I guess I'd then have to append/concat to an initial empty DataFrame, but the question is then what the structure has to look like. It can't be a simple Series, because that does not take multi index in account. So how?

For the people who have read this far and want to try their hand at this, I think that my original code can be re-used for the most part (loop and variable assignment), but instead of a dict it have to be accessors to the DataFrame. That an import aspect: data should be easily readable with getters/setters, just as a regular DataFrame is. E.g. it should be easy to get the duration value for participant two, task 2, segment 2, and so on. But also, getting a subset of the data (e.g. where prof === 0) should be without problems.

like image 614
Bram Vanroy Avatar asked Mar 06 '18 13:03

Bram Vanroy


1 Answers

My only suggestion is to get rid of all your dictionary stuff. All of that code can be re-written in Pandas without much effort. This will likely speed up the transformation process as well but will take some time. To help you in the process I have rewritten the section you provided. The rest is up to you.

import pandas as pd

cols = ['Participant_n', 'Task_n', 'val', 'dur']

data = [[1,1,25,83],
        [1,1,4,68],
        [1,1,9,987],
        [1,2,98,98],
        [1,2,84,4],
        [2,1,9,21],
        [2,2,15,6],
        [2,2,185,6],
        [2,2,18,4],
        [2,3,8,12],
        [3,1,7,78],
        [3,1,12,88],
        [3,2,12,48]]

df = pd.DataFrame(data, columns=cols)
df["Task Sum val"] = df.groupby(["Participant_n","Task_n"])["val"].transform("sum")
df["Task Sum dur"] = df.groupby(["Participant_n","Task_n"])["dur"].transform("sum")
df["seg"] =df.groupby(["Participant_n","Task_n"]).cumcount() + 1
df["All Sum val"] = df.groupby("Participant_n")["val"].transform("sum")
df["All Sum dur"] = df.groupby("Participant_n")["dur"].transform("sum")
df = df.set_index(["Participant_n","All Sum val","All Sum dur","Task_n","Task Sum val","Task Sum dur"])[["seg","val","dur"]]
df = df.sort_index()
df

Output

                                                                        seg  val  dur
Participant_n All Sum val All Sum dur Task_n Task Sum val Task Sum dur               
1             220         1240        1      38           1138            1   25   83
                                                          1138            2    4   68
                                                          1138            3    9  987
                                      2      182          102             1   98   98
                                                          102             2   84    4
2             235         49          1      9            21              1    9   21
                                      2      218          16              1   15    6
                                                          16              2  185    6
                                                          16              3   18    4
                                      3      8            12              1    8   12
3             31          214         1      19           166             1    7   78
                                                          166             2   12   88
                                      2      12           48              1   12   48

Try to run this code and let me know what you think. Comment with any questions.

like image 72
Gabriel A Avatar answered Oct 16 '22 19:10

Gabriel A