Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: conditional group-specific computations

Tags:

python

pandas

Let's say I have a table with a key (e.g. customer ID) and two numeric columns C1 and C2. I would like to group rows by the key (customer) and run some aggregators like sum and mean on its columns. After computing group aggregators I would like to assign the results back to each customer row in a DataFrame (as some customer-wide features added to each row).

I can see that I can do something like
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
if I want to aggregate just one column and be able to add the result back to the DataFrame.

Can I make it conditional - can I add up C1 column in a group only for rows whose C2 column is equal to some number X and still be able to add results back to the DataFrame?

How can I run aggregator on a combination of rows like:
np.sum(C1 + C2)?

What would be the simplest and most elegant way to implement it? What is the most efficient way to do it? Can those aggregations be done in a one path?

Thank you in advance.

like image 840
kpax Avatar asked Jun 16 '14 19:06

kpax


Video Answer


1 Answers

Here's some setup of some dummy data.

In [81]: df = pd.DataFrame({'Key': ['a','a','b','b','c','c'], 
                            'C1': [1,2,3,4,5,6],  
                            'C2': [7,8,9,10,11,12]})
In [82]: df['F1'] = df.groupby('Key')['C1'].transform(np.sum)

In [83]: df
Out[83]: 
   C1  C2 Key  F1
0   1   7   a   3
1   2   8   a   3
2   3   9   b   7
3   4  10   b   7
4   5  11   c  11
5   6  12   c  11

If you want to do a conditional GroupBy, you can just filter the dataframe as it's passed to .groubpy. For example, if you wanted the group sum of 'C1' if C2 is less than 8 or greater than 9.

In [87]: cond = (df['C2'] < 8) | (df['C2'] > 9)

In [88]: df['F2'] = df[cond].groupby('Key')['C1'].transform(np.sum)

In [89]: df
Out[89]: 
   C1  C2 Key  F1  F2
0   1   7   a   3   1
1   2   8   a   3 NaN
2   3   9   b   7 NaN
3   4  10   b   7   4
4   5  11   c  11  11
5   6  12   c  11  11

This works because the transform operation preserves the index, so it will still align with the original dataframe correctly.

If you want to sum the group totals for two columns, probably easiest to do something like this? Someone may have something more clever.

In [93]: gb = df.groupby('Key')

In [94]: df['C1+C2'] = gb['C1'].transform(np.sum) + gb['C2'].transform(np.sum)

Edit: Here's one other way to get group totals for multiple columns. The syntax isn't really any cleaner, but may be more convenient for a large number of a columns.

df['C1_C2'] = gb[['C1','C2']].apply(lambda x: pd.DataFrame(x.sum().sum(), index=x.index, columns=['']))
like image 74
chrisb Avatar answered Oct 08 '22 01:10

chrisb