I have a big dataframe, and I'm grouping by one to n columns, and want to apply a function on these groups across two columns (e.g. foo and bar).
Here's an example dataframe:
foo_function = lambda x: np.sum(x.a+x.b)
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
# works with apply, but I want transform:
df.groupby(['c', 'd'])[['a','b']].apply(foo_function)
# transform doesn't work!
df.groupby(['c', 'd'])[['a','b']].transform(foo_function)
TypeError: cannot concatenate a non-NDFrame object
But transform
apparently isn't able to combine multiple columns together because it looks at each column separately (unlike apply). What is the next best alternative in terms of speed / elegance? e.g. I could use apply
and then create df['new_col']
by using pd.match
, but that would necessitate matching over sometimes multiple groupby columns (col1 and col2) which seems really hacky / would take a fair amount of code.
--> Is there a function that is like groupby().transform that can use functions that work over multiple columns? If this doesn't exist, what's the best hack?
Circa Pandas version 0.18, it appears the original answer (below) no longer works.
Instead, if you need to do a groupby computation across multiple columns, do the multi-column computation first, and then the groupby:
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
df['e'] = df['a'] + df['b']
df['e'] = (df.groupby(['c', 'd'])['e'].transform('sum'))
print(df)
yields
a b c d e
0 1 1 q z 12
1 2 2 q z 12
2 3 3 q z 12
3 4 4 q o 8
4 5 5 w o 22
5 6 6 w o 22
Original answer:
The error message:
TypeError: cannot concatenate a non-NDFrame object
suggests that in order to concatenate, the foo_function
should return an NDFrame (such as a Series or DataFrame). If you return a Series, then:
In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b'])))
Out[99]:
a b
0 12 12
1 12 12
2 12 12
3 8 8
4 22 22
5 22 22
The way I read the question, you want to be able to do something arbitrary with both the individual values from both columns. You just need to make sure to return a dataframe of the same size as you get passed in. I think the best way is to just make a new column, like this:
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
df['e']=0
def f(x):
y=(x['a']+x['b'])/sum(x['b'])
return pd.DataFrame({'e':y,'a':x['a'],'b':x['b']})
df.groupby(['c','d']).transform(f)
:
a b e
0 1 1 0.333333
1 2 2 0.666667
2 3 3 1.000000
3 4 4 2.000000
4 5 5 0.909091
5 6 6 1.090909
If you have a very complicated dataframe, you can pick your columns (e.g. df.groupby(['c'])['a','b','e'].transform(f)
)
This sure looks very inelegant to me, but it's still much faster than apply
on large datasets.
Another alternative is to use set_index
to capture all the columns you need and then pass just one column to transform
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With