Let's say I have a table with a key (e.g. customer ID) and two numeric columns C1 and C2. I would like to group rows by the key (customer) and run some aggregators like sum and mean on its columns. After computing group aggregators I would like to assign the results back to each customer row in a DataFrame (as some customer-wide features added to each row).
I can see that I can do something likedf['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
if I want to aggregate just one column and be able to add the result back to the DataFrame.
Can I make it conditional - can I add up C1 column in a group only for rows whose C2 column is equal to some number X and still be able to add results back to the DataFrame?
How can I run aggregator on a combination of rows like:
np.sum(C1 + C2)?
What would be the simplest and most elegant way to implement it? What is the most efficient way to do it? Can those aggregations be done in a one path?
Thank you in advance.
Here's some setup of some dummy data.
In [81]: df = pd.DataFrame({'Key': ['a','a','b','b','c','c'],
'C1': [1,2,3,4,5,6],
'C2': [7,8,9,10,11,12]})
In [82]: df['F1'] = df.groupby('Key')['C1'].transform(np.sum)
In [83]: df
Out[83]:
C1 C2 Key F1
0 1 7 a 3
1 2 8 a 3
2 3 9 b 7
3 4 10 b 7
4 5 11 c 11
5 6 12 c 11
If you want to do a conditional GroupBy, you can just filter the dataframe as it's passed to .groubpy
. For example, if you wanted the group sum of 'C1' if C2 is less than 8 or greater than 9.
In [87]: cond = (df['C2'] < 8) | (df['C2'] > 9)
In [88]: df['F2'] = df[cond].groupby('Key')['C1'].transform(np.sum)
In [89]: df
Out[89]:
C1 C2 Key F1 F2
0 1 7 a 3 1
1 2 8 a 3 NaN
2 3 9 b 7 NaN
3 4 10 b 7 4
4 5 11 c 11 11
5 6 12 c 11 11
This works because the transform
operation preserves the index, so it will still align with the original dataframe correctly.
If you want to sum the group totals for two columns, probably easiest to do something like this? Someone may have something more clever.
In [93]: gb = df.groupby('Key')
In [94]: df['C1+C2'] = gb['C1'].transform(np.sum) + gb['C2'].transform(np.sum)
Edit: Here's one other way to get group totals for multiple columns. The syntax isn't really any cleaner, but may be more convenient for a large number of a columns.
df['C1_C2'] = gb[['C1','C2']].apply(lambda x: pd.DataFrame(x.sum().sum(), index=x.index, columns=['']))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With