In a pandas dataframe, a function can be used to group its index. I'm looking to define a function that instead is applied to a column.
I'm looking to group by two columns, except I need the second column to be grouped by an arbitrary function, foo
:
group_sum = df.groupby(['name', foo])['tickets'].sum()
How would foo
be defined to group the second column into two groups, demarcated by whether values are > 0
, for example? Or, is an entirely different approach or syntax used?
Pandas apply() Function to Single & Multiple Column(s) Using pandas. DataFrame. apply() method you can execute a function to a single column, all and list of multiple columns (two or more).
Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.
Groupby can accept any combination of both labels and series/arrays (as long as the array has the same length as your dataframe), so you can map the function to your column and pass it into the groupby, like
df.groupby(['name', df[1].map(foo)])
Alternatively you might want to add the condition as a new column to your dataframe before your perform the groupby, this will have the advantage of giving it a name in the index:
df['>0'] = df[1] > 0
group_sum = df.groupby(['name', '>0'])['tickets'].sum()
Something like this will work:
x.groupby(['name', x['value']>0])['tickets'].sum()
Like mentioned above the groupby
can accept labels and series. This should give you the answer you are looking for. Here is an example:
data = np.array([[1, -1, 20], [1, 1, 50], [1, 1, 50], [2, 0, 100]])
x = pd.DataFrame(data, columns = ['name', 'value', 'value2'])
x.groupby(['name', x['value']>0])['value2'].sum()
name value
1 False 20
True 100
2 False 100
Name: value2, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With