Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add a column with a groupby on a hierarchical dataframe

I have a dataframe structured like this:

First     A                             B                         
Second  bar       baz       foo       bar       baz       foo     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    8    7    7    4    7    5    3    2    2    6    2
1         8    6    5    7    8    7    1    8    6    0    3    9
2         9    2    2    9    7    3    1    8    4    1    0    8
3         3    6    0    6    3    2    2    6    2    4    6    9
4         7    6    4    3    1    5    0    4    8    4    8    1

So there are three column levels. I want to add a new column on the second level where for each of the third levels a computation is performed, for example 'new' = 'foo' + 'bar'. So the resulting dataframe would look like:

First     A                                       B                                   
Second  bar       baz       foo       new       bar       baz       foo       new     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    8    7    7    4    7    7   15    5    3    2    2    6    2   11    5
1         8    6    5    7    8    7   16   13    1    8    6    0    3    9    4   17
2         9    2    2    9    7    3   16    5    1    8    4    1    0    8    1   16
3         3    6    0    6    3    2    6    8    2    6    2    4    6    9    8   15
4         7    6    4    3    1    5    8   11    0    4    8    4    8    1    8    5

I have found a workaround which is listed at the end of this post, but its not at all 'panda-style' and prone to errors. The apply or transform function on a group seems like the right way to go but after hours of trying I still do not succeed. I figured the correct way should be something like:

def func(data):

    fi = data.columns[0][0]
    th = data.columns[0][2]

    data[(fi,'new',th)] = data[(fi,'foo',th)] + data[(fi,'bar',th)]

    print data
    return data

print grouped.apply(func)

The new column is properly added within the function, but is not returned. Using the same function with transform would work if the 'new' column already exists in the df, but how do you add a new column at a specific level 'on the fly' or before grouping?

The code to generate the sample df is:

import pandas, itertools

first = ['A','B']
second = ['foo','bar','baz']
third = ['dog', 'cat']

tuples = []
for tup in itertools.product(first, second, third):
    tuples.append(tup)

columns = pandas.MultiIndex.from_tuples(tuples, names=['First','Second','Third'])

data = np.random.randint(0,10,(5, 12))
df = pandas.DataFrame(data, columns=columns)

And my workaround:

dfnew = None
grouped = df.groupby(by=None, level=[0,2], axis=1)

for name, group in grouped:
    newparam = group.xs('foo', axis=1, level=1) + group.xs('bar', axis=1, level=1)

    dftmp = group.join(pandas.DataFrame(np.array(newparam), columns=pandas.MultiIndex.from_tuples([(group.columns[0][0], 'new', group.columns[0][2])], names=['First','Second', 'Third'])))

    if dfnew is None:
        dfnew = dftmp
    else:
        dfnew = pandas.concat([dfnew, dftmp], axis=1)

print dfnew.sort_index(axis=1)

Which works, but creating a new dataframe for each group and 'manually' assigning the levels is a really bad practice.

So what is the proper way to do this? I found several posts dealing with similar questions, but all of these had only 1 level of columns, and that's exactly what I'm struggling with.

like image 657
Rutger Kassies Avatar asked Oct 18 '12 08:10

Rutger Kassies


People also ask

How do I get Groupby columns in pandas?

You can also reset_index() on your groupby result to get back a dataframe with the name column now accessible. If you perform an operation on a single column the return will be a series with multiindex and you can simply apply pd. DataFrame to it and then reset_index. Show activity on this post.

How do you get Groupby Multiindex pandas?

You can use the following basic syntax to use GroupBy on a pandas DataFrame with a multiindex: #calculate sum by level 0 and 1 of multiindex df. groupby(level=[0,1]). sum() #calculate count by level 0 and 1 of multiindex df.

Does Groupby object supports column indexing?

First look at GroupBy We did not tell GroupBy which column we wanted it to apply the aggregation function on, so it applied it to all the relevant columns and returned the output. But fortunately, GroupBy object supports column indexing just like a DataFrame!


1 Answers

There definitely is a weakness in the API here but I'm not sure off the top of my head to make it easier to do what you're doing. Here's one simple way around this, at least for your example:

In [20]: df
Out[20]: 
First     A                             B                         
Second  foo       bar       baz       foo       bar       baz     
Third   dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat
0         7    2    9    3    3    0    5    9    8    2    0    6
1         1    4    1    7    2    3    2    3    1    0    4    0
2         6    5    0    6    6    1    5    1    7    4    3    6
3         4    8    1    9    0    3    9    2    3    1    5    9
4         6    1    1    5    1    2    2    6    3    7    2    1

In [21]: rdf = df.stack(['First', 'Third'])

In [22]: rdf['new'] = rdf.foo + rdf.bar

In [23]: rdf
Out[23]: 
Second         bar  baz  foo  new
  First Third                    
0 A     cat      3    0    2    5
        dog      9    3    7   16
  B     cat      2    6    9   11
        dog      8    0    5   13
1 A     cat      7    3    4   11
        dog      1    2    1    2
  B     cat      0    0    3    3
        dog      1    4    2    3
2 A     cat      6    1    5   11
        dog      0    6    6    6
  B     cat      4    6    1    5
        dog      7    3    5   12
3 A     cat      9    3    8   17
        dog      1    0    4    5
  B     cat      1    9    2    3
        dog      3    5    9   12
4 A     cat      5    2    1    6
        dog      1    1    6    7
  B     cat      7    1    6   13
        dog      3    2    2    5

In [24]: rdf.unstack(['First', 'Third'])
Out[24]: 
Second  bar                 baz                 foo                 new               
First     A         B         A         B         A         B         A         B     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    9    2    8    0    3    6    0    2    7    9    5    5   16   11   13
1         7    1    0    1    3    2    0    4    4    1    3    2   11    2    3    3
2         6    0    4    7    1    6    6    3    5    6    1    5   11    6    5   12
3         9    1    1    3    3    0    9    5    8    4    2    9   17    5    3   12
4         5    1    7    3    2    1    1    2    1    6    6    2    6    7   13    5

And you can of course rearrange to your heart's content:

In [28]: rdf.unstack(['First', 'Third']).reorder_levels(['First', 'Second', 'Third'], axis=1).sortlevel(0, axis=1)
Out[28]: 
First     A                                       B                                   
Second  bar       baz       foo       new       bar       baz       foo       new     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    9    0    3    2    7    5   16    2    8    6    0    9    5   11   13
1         7    1    3    2    4    1   11    2    0    1    0    4    3    2    3    3
2         6    0    1    6    5    6   11    6    4    7    6    3    1    5    5   12
3         9    1    3    0    8    4   17    5    1    3    9    5    2    9    3   12
4         5    1    2    1    1    6    6    7    7    3    1    2    6    2   13    5
like image 133
Wes McKinney Avatar answered Oct 19 '22 06:10

Wes McKinney