Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use groupby transform across multiple columns

Tags:

python

pandas

I have a big dataframe, and I'm grouping by one to n columns, and want to apply a function on these groups across two columns (e.g. foo and bar).

Here's an example dataframe:

foo_function = lambda x: np.sum(x.a+x.b)

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})

# works with apply, but I want transform:
df.groupby(['c', 'd'])[['a','b']].apply(foo_function)
# transform doesn't work!
df.groupby(['c', 'd'])[['a','b']].transform(foo_function)
TypeError: cannot concatenate a non-NDFrame object

But transform apparently isn't able to combine multiple columns together because it looks at each column separately (unlike apply). What is the next best alternative in terms of speed / elegance? e.g. I could use apply and then create df['new_col'] by using pd.match, but that would necessitate matching over sometimes multiple groupby columns (col1 and col2) which seems really hacky / would take a fair amount of code.

--> Is there a function that is like groupby().transform that can use functions that work over multiple columns? If this doesn't exist, what's the best hack?

like image 441
Hillary Sanders Avatar asked Dec 05 '15 00:12

Hillary Sanders


2 Answers

Circa Pandas version 0.18, it appears the original answer (below) no longer works.

Instead, if you need to do a groupby computation across multiple columns, do the multi-column computation first, and then the groupby:

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})
df['e'] = df['a'] + df['b']
df['e'] = (df.groupby(['c', 'd'])['e'].transform('sum'))
print(df)

yields

   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22

Original answer:

The error message:

TypeError: cannot concatenate a non-NDFrame object

suggests that in order to concatenate, the foo_function should return an NDFrame (such as a Series or DataFrame). If you return a Series, then:

In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b'])))
Out[99]: 
    a   b
0  12  12
1  12  12
2  12  12
3   8   8
4  22  22
5  22  22
like image 141
unutbu Avatar answered Nov 17 '22 12:11

unutbu


The way I read the question, you want to be able to do something arbitrary with both the individual values from both columns. You just need to make sure to return a dataframe of the same size as you get passed in. I think the best way is to just make a new column, like this:

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})
df['e']=0

def f(x):
    y=(x['a']+x['b'])/sum(x['b'])
    return pd.DataFrame({'e':y,'a':x['a'],'b':x['b']})

df.groupby(['c','d']).transform(f)

:

    a   b   e
0   1   1   0.333333
1   2   2   0.666667
2   3   3   1.000000
3   4   4   2.000000
4   5   5   0.909091
5   6   6   1.090909

If you have a very complicated dataframe, you can pick your columns (e.g. df.groupby(['c'])['a','b','e'].transform(f))

This sure looks very inelegant to me, but it's still much faster than apply on large datasets.

Another alternative is to use set_index to capture all the columns you need and then pass just one column to transform.

like image 28
Victor Chubukov Avatar answered Nov 17 '22 12:11

Victor Chubukov