Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Groupby with User Defined Functions Pandas

Tags:

python

pandas

I understand that passing a function as a group key calls the function once per index value with the return values being used as the group names. What I can't figure out is how to call the function on column values.

So I can do this:

people = pd.DataFrame(np.random.randn(5, 5),                        columns=['a', 'b', 'c', 'd', 'e'],                       index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']) def GroupFunc(x):     if len(x) > 3:         return 'Group1'     else:         return 'Group2'  people.groupby(GroupFunc).sum() 

This splits the data into two groups, one of which has index values of length 3 or less, and the other with length three or more. But how can I pass one of the column values? So for example if column d value for each index point is greater than 1. I realise I could just do the following:

people.groupby(people.a > 1).sum() 

But I want to know how to do this in a user defined function for future reference.

Something like:

def GroupColFunc(x): if x > 1:     return 'Group1' else:     return 'Group2' 

But how do I call this? I tried

people.groupby(GroupColFunc(people.a)) 

and similar variants but this does not work.

How do I pass the column values to the function? How would I pass multiple column values e.g. to group on whether people.a > people.b for example?

like image 776
Woody Pride Avatar asked Oct 27 '13 07:10

Woody Pride


People also ask

How do I use custom function on Groupby pandas?

Simply use the apply method to each dataframe in the groupby object. This is the most straightforward way and the easiest to understand. Notice that the function takes a dataframe as its only argument, so any code within the custom function needs to work on a pandas dataframe.

Can I use group by without aggregate function pandas?

Instead of using groupby aggregation together, we can perform groupby without aggregation which is applicable to aggregate data separately.


1 Answers

To group by a > 1, you can define your function like:

>>> def GroupColFunc(df, ind, col): ...     if df[col].loc[ind] > 1: ...         return 'Group1' ...     else: ...         return 'Group2' ...  

An then call it like

>>> people.groupby(lambda x: GroupColFunc(people, x, 'a')).sum()                a         b         c         d        e Group2 -2.384614 -0.762208  3.359299 -1.574938 -2.65963 

Or you can do it only with anonymous function:

>>> people.groupby(lambda x: 'Group1' if people['b'].loc[x] > people['a'].loc[x] else 'Group2').sum()                a         b         c         d         e Group1 -3.280319 -0.007196  1.525356  0.324154 -1.002439 Group2  0.895705 -0.755012  1.833943 -1.899092 -1.657191 

As said in documentation, you can also group by passing Series providing a label -> group name mapping:

>>> mapping = np.where(people['b'] > people['a'], 'Group1', 'Group2') >>> mapping Joe       Group2 Steve     Group1 Wes       Group2 Jim       Group1 Travis    Group1 dtype: string48 >>> people.groupby(mapping).sum()                a         b         c         d         e Group1 -3.280319 -0.007196  1.525356  0.324154 -1.002439 Group2  0.895705 -0.755012  1.833943 -1.899092 -1.657191 
like image 76
Roman Pekar Avatar answered Oct 06 '22 11:10

Roman Pekar