After creating DataFrame with some duplicated cell values in the column Name:
import pandas as pd
df = pd.DataFrame({'Name': ['Will','John','John','John','Alex'],
'Payment': [15, 10, 10, 10, 15],
'Duration': [30, 15, 15, 15, 20]})
I would like to proceed by creating another DataFrame where the duplicated values in Name column are consolidated leaving no duplicates. At the same time I want to sum the payments values John made. I proceed with:
df_sum = df.groupby('Name', axis=0).sum().reset_index()
But since df.groupby('Name', axis=0).sum()
command applies the sum function to every column in DataFrame the Duration (of the visit in minutes) column is processed as well. Instead I would like to get an average values for the Duration column. So I would need to use mean()
method, like so:
df_mean = df.groupby('Name', axis=0).mean().reset_index()
But with mean()
function the column Payment is now showing the average payment values John made and not the sum of all the payments.
How to create a DataFrame where Duration values show the average values while the Payment values show the sum?
You can apply different functions to different columns with groupby.agg:
df.groupby('Name').agg({'Duration': 'mean', 'Payment': 'sum'})
Out:
Payment Duration
Name
Alex 15 20
John 30 15
Will 15 30
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With