Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas Conditional Sum with Groupby

Using sample data:

df = pd.DataFrame({'key1' : ['a','a','b','b','a'],                'key2' : ['one', 'two', 'one', 'two', 'one'],                'data1' : np.random.randn(5),                'data2' : np. random.randn(5)}) 

df

    data1        data2     key1  key2 0    0.361601    0.375297    a   one 1    0.069889    0.809772    a   two 2    1.468194    0.272929    b   one 3   -1.138458    0.865060    b   two 4   -0.268210    1.250340    a   one 

I'm trying to figure out how to group the data by key1 and sum only the data1 values where key2 equals 'one'.

Here's what I've tried

def f(d,a,b):     d.ix[d[a] == b, 'data1'].sum()  df.groupby(['key1']).apply(f, a = 'key2', b = 'one').reset_index() 

But this gives me a dataframe with 'None' values

index   key1    0 0       a       None 1       b       None 

Any ideas here? I'm looking for the Pandas equivalent of the following SQL:

SELECT Key1, SUM(CASE WHEN Key2 = 'one' then data1 else 0 end) FROM df GROUP BY key1 

FYI - I've seen conditional sums for pandas aggregate but couldn't transform the answer provided there to work with sums rather than counts.

Thanks in advance

like image 520
AllenQ Avatar asked Jun 23 '13 23:06

AllenQ


People also ask

How do you sum a column by group in Python?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.

How do you sum values in a DataFrame based on a condition?

How to Sum rows by condition Pandas Dataframe using loc[] The Python Pandas sum() function returns the sum of the given values over the requested axis. If the parameter is axis=0 for summing the rows of Pandas dataframe it is the default value.

What is possible using Groupby () method of pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.


2 Answers

First groupby the key1 column:

In [11]: g = df.groupby('key1') 

and then for each group take the subDataFrame where key2 equals 'one' and sum the data1 column:

In [12]: g.apply(lambda x: x[x['key2'] == 'one']['data1'].sum()) Out[12]: key1 a       0.093391 b       1.468194 dtype: float64 

To explain what's going on let's look at the 'a' group:

In [21]: a = g.get_group('a')  In [22]: a Out[22]:       data1     data2 key1 key2 0  0.361601  0.375297    a  one 1  0.069889  0.809772    a  two 4 -0.268210  1.250340    a  one  In [23]: a[a['key2'] == 'one'] Out[23]:       data1     data2 key1 key2 0  0.361601  0.375297    a  one 4 -0.268210  1.250340    a  one  In [24]: a[a['key2'] == 'one']['data1'] Out[24]: 0    0.361601 4   -0.268210 Name: data1, dtype: float64  In [25]: a[a['key2'] == 'one']['data1'].sum() Out[25]: 0.093391000000000002 

It may be slightly easier/clearer to do this by restricting the dataframe to just those with key2 equals one first:

In [31]: df1 = df[df['key2'] == 'one']  In [32]: df1 Out[32]:       data1     data2 key1 key2 0  0.361601  0.375297    a  one 2  1.468194  0.272929    b  one 4 -0.268210  1.250340    a  one  In [33]: df1.groupby('key1')['data1'].sum() Out[33]: key1 a       0.093391 b       1.468194 Name: data1, dtype: float64 
like image 132
Andy Hayden Avatar answered Oct 12 '22 08:10

Andy Hayden


I think that today with pandas 0.23 you can do this:

import numpy as np   df.assign(result = np.where(df['key2']=='one',df.data1,0))\    .groupby('key1').agg({'result':sum}) 

The advantage of this is that you can apply it to more than one column of the same dataframe

df.assign(  result1 = np.where(df['key2']=='one',df.data1,0),  result2 = np.where(df['key2']=='two',df.data1,0)   ).groupby('key1').agg({'result1':sum, 'result2':sum}) 
like image 28
Diego Avatar answered Oct 12 '22 07:10

Diego