Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying a custom groupby aggregate function to output a binary outcome in pandas python

Tags:

I have a dataset of trader transactions where the variable of interest is Buy/Sell which is binary and takes on the value of 1 f the transaction was a buy and 0 if it is a sell. An example looks as follows:

Trader     Buy/Sell   A           1   A           0   B           1   B           1   B           0   C           1   C           0   C           0 

I would like to calculate the net Buy/Sell for each trader such that if the trader had more than 50% of trades as a buy, he would have a Buy/Sell of 1, if he had less than 50% buy then he would have a Buy/Sell of 0 and if it were exactly 50% he would have NA (and would be disregarded in future calculations).

So for trader A, the buy proportion is (number of buys)/(total number of trader) = 1/2 = 0.5 which gives NA.

For trader B it is 2/3 = 0.67 which gives a 1

For trader C it is 1/3 = 0.33 which gives a 0

The table should look like this:

Trader     Buy/Sell   A           NA   B           1   C           0  

Ultimately i want to compute the total aggregated number of buys, which in this case is 1, and the aggregated total number of trades (disregarding NAs) which in this case is 2. I am not interested in the second table, I am just interested in the aggregated number of buys and the aggregated total number (count) of Buy/Sell.

How can I do this in Pandas?

like image 387
finstats Avatar asked Nov 08 '14 01:11

finstats


People also ask

What is possible using GroupBy () method of pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.

What does the function GroupBy () in the pandas library accomplish?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.


1 Answers

import numpy as np import pandas as pd  df = pd.DataFrame({'Buy/Sell': [1, 0, 1, 1, 0, 1, 0, 0],                    'Trader': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']})  grouped = df.groupby(['Trader']) result = grouped['Buy/Sell'].agg(['sum', 'count']) means = grouped['Buy/Sell'].mean() result['Buy/Sell'] = np.select(condlist=[means>0.5, means<0.5], choicelist=[1, 0],      default=np.nan) print(result) 

yields

        Buy/Sell  sum  count Trader                       A            NaN    1      2 B              1    2      3 C              0    1      3 

My original answer used a custom aggregator, categorize:

def categorize(x):     m = x.mean()     return 1 if m > 0.5 else 0 if m < 0.5 else np.nan result = df.groupby(['Trader'])['Buy/Sell'].agg([categorize, 'sum', 'count']) result = result.rename(columns={'categorize' : 'Buy/Sell'}) 

While calling a custom function may be convenient, performance is often significantly slower when you use a custom function compared to the built-in aggregators (such as groupby/agg/mean). The built-in aggregators are Cythonized, while the custom functions reduce performance to plain Python for-loop speeds.

The difference in speed is particularly significant when the number of groups is large. For example, with a 10000-row DataFrame with 1000 groups,

import numpy as np import pandas as pd np.random.seed(2017) N = 10000 df = pd.DataFrame({     'Buy/Sell': np.random.randint(2, size=N),     'Trader': np.random.randint(1000, size=N)})  def using_select(df):     grouped = df.groupby(['Trader'])     result = grouped['Buy/Sell'].agg(['sum', 'count'])     means = grouped['Buy/Sell'].mean()     result['Buy/Sell'] = np.select(condlist=[means>0.5, means<0.5], choicelist=[1, 0],          default=np.nan)     return result  def categorize(x):     m = x.mean()     return 1 if m > 0.5 else 0 if m < 0.5 else np.nan  def using_custom_function(df):     result = df.groupby(['Trader'])['Buy/Sell'].agg([categorize, 'sum', 'count'])     result = result.rename(columns={'categorize' : 'Buy/Sell'})     return result 

using_select is over 50x faster than using_custom_function:

In [69]: %timeit using_custom_function(df) 10 loops, best of 3: 132 ms per loop  In [70]: %timeit using_select(df) 100 loops, best of 3: 2.46 ms per loop  In [71]: 132/2.46 Out[71]: 53.65853658536585 
like image 131
unutbu Avatar answered Sep 22 '22 15:09

unutbu