Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by, count and calculate proportions in pandas?

Tags:

python

pandas

I have a dataframe as follows:

d = {
  'id': [1, 2, 3, 4, 5], 
  'is_overdue': [True, False, True, True, False],
  'org': ['A81001', 'A81002', 'A81001', 'A81002', 'A81003']
}
df = pd.DataFrame(data=d)

Now I want to work out for each organisation, what percentage of rows are overdue, and what percentage are not.

I know how to group by organisation and overdue status:

df.groupby(['org', 'is_overdue']).agg('count')

But how do I get the proportion by organisation? I want to end up with something like this:

org      is_overdue   not_overdue   proportion_overdue
A81001   2            0             100
A81002   1            1             50
A81003   0            1             0
like image 553
Richard Avatar asked Jul 22 '16 15:07

Richard


3 Answers

You could use DataFrame.apply.

First group by the organizations and count the number of overdue/non-overdue. Then calculate the percentage.

df_overdue = df.groupby(['org']).apply(lambda dft: pd.Series({'is_overdue': dft.is_overdue.sum(), 'not_overdue': (~dft.is_overdue).sum()}))
df_overdue['proportion_overdue'] = df_overdue['is_overdue'] / (df_overdue['not_overdue'] + df_overdue['is_overdue'])

print(df_overdue)

outputs

        is_overdue  not_overdue  proportion_overdue
org                                                
A81001           2            0                 1.0
A81002           1            1                 0.5
A81003           0            1                 0.0
like image 136
Alex Avatar answered Oct 15 '22 09:10

Alex


You could use pd.crosstab to create a frequency table -- i.e. to count the number of is_overdues for each org.

import pandas as pd
d = {
  'id': [1, 2, 3, 4, 5], 
  'is_overdue': [True, False, True, True, False],
  'org': ['A81001', 'A81002', 'A81001', 'A81002', 'A81003']
}
df = pd.DataFrame(data=d)

result = pd.crosstab(index=df['org'], columns=df['is_overdue'], margins=True)
result = result.rename(columns={True:'is_overdue', False:'not overdue'})
result['proportion'] = result['is_overdue']/result['All']*100
print(result)

yields

is_overdue  not overdue  is_overdue  All  proportion
org                                                 
A81001                0           2    2       100.0
A81002                1           1    2        50.0
A81003                1           0    1         0.0
All                   2           3    5        60.0
like image 5
unutbu Avatar answered Oct 15 '22 07:10

unutbu


There are more efficient ways to do this but since you were trying to use aggregate functions initially, this is the way to solve your problem using aggregate functions:

df.is_overdue = df.is_overdue.map({True: 1, False: 0})
df.groupby(['org'])['is_overdue'].agg({'total_count':'count', 'is_overdue':   'sum'}).reset_index()

Now you can just calculate not_overdue and proportion_overdue very easily.

like image 2
A.Kot Avatar answered Oct 15 '22 08:10

A.Kot