Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply function to 2nd column in pandas dataframe groupby

In a pandas dataframe, a function can be used to group its index. I'm looking to define a function that instead is applied to a column.

I'm looking to group by two columns, except I need the second column to be grouped by an arbitrary function, foo:

group_sum = df.groupby(['name', foo])['tickets'].sum()

How would foo be defined to group the second column into two groups, demarcated by whether values are > 0, for example? Or, is an entirely different approach or syntax used?

like image 533
Brian Bien Avatar asked Oct 25 '16 23:10

Brian Bien


People also ask

How do I apply a function to multiple columns in pandas?

Pandas apply() Function to Single & Multiple Column(s) Using pandas. DataFrame. apply() method you can execute a function to a single column, all and list of multiple columns (two or more).

Can you Groupby two columns pandas?

Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.


Video Answer


2 Answers

Groupby can accept any combination of both labels and series/arrays (as long as the array has the same length as your dataframe), so you can map the function to your column and pass it into the groupby, like

df.groupby(['name', df[1].map(foo)])

Alternatively you might want to add the condition as a new column to your dataframe before your perform the groupby, this will have the advantage of giving it a name in the index:

df['>0'] = df[1] > 0
group_sum = df.groupby(['name', '>0'])['tickets'].sum()
like image 145
maxymoo Avatar answered Oct 17 '22 01:10

maxymoo


Something like this will work:

x.groupby(['name', x['value']>0])['tickets'].sum()

Like mentioned above the groupby can accept labels and series. This should give you the answer you are looking for. Here is an example:

data = np.array([[1, -1, 20], [1, 1, 50], [1, 1, 50], [2, 0, 100]])
x = pd.DataFrame(data, columns = ['name', 'value', 'value2'])
x.groupby(['name', x['value']>0])['value2'].sum()

name  value
1     False     20
      True     100
2     False    100
Name: value2, dtype: int64
like image 24
RDizzl3 Avatar answered Oct 17 '22 01:10

RDizzl3