Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use groupby in pandas to calculate a percentage / proportion total based on a criteria in another column

I'm trying to work out how to use the groupby function in pandas to work out the proportions of values per year with a given Yes/No criteria.

For example, I have a dataframe called names:

  Name  Number  Year   Sex Criteria
0  name1     789  1998  Male      N
1  name1     688  1999  Male      N
2  name1     639  2000  Male      N
3  name2     551  1998  Male      Y
4  name2     499  1999  Male      Y

I can use

namesgrouped = names.groupby(["Sex", "Year", "Criteria"]).sum()

to get:

                   Number
Sex    Year      Criteria
Male   1998 N        14507
            Y         2308
       1999 N        14119
            Y         2331

and so on. I would like the 'Number Criteria' column to show the % of the total for each gender and year - so instead of N = 14507 and Y = 2308 for 1998 above I'd have N = 86.27% and Y = 13.73%.

Can anyone advise how to do this?

like image 567
fuzzy_logic_77 Avatar asked May 02 '16 17:05

fuzzy_logic_77


People also ask

How do you calculate percentage in Groupby pandas?

You can calculate the percentage of total with the groupby of pandas DataFrame by using DataFrame. groupby() , DataFrame. agg() , DataFrame. transform() methods and DataFrame.

Can you use Groupby with multiple columns in pandas?

groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

How do you calculate cumulative percentage in pandas?

First, create a data frame as 'data_frame' and provide the values you need to calculate the cumulative sum, then pass the 'data_frame' parameter to pd. DataFrame() while specifying the column values, and finally, use the cumsum() and sum() built-in functions to calculate the cumulative percentage.

What is possible using Groupby () method of pandas?

groupby() can accept several different arguments: A column or list of columns. A dict or pandas Series. A NumPy array or pandas Index , or an array-like iterable of these.


1 Answers

This question is a direct extension of the suggested duplicate. Borrowing from the accepted answer, this will work:

In [46]: namesgrouped.groupby(level=[0, 1]).apply(lambda g: g / g.sum())
Out[46]: 
                      Number
Sex  Year Criteria          
Male 1998 N         0.588806
          Y         0.411194
     1999 N         0.579612
          Y         0.420388
     2000 N         1.000000

Edit: a transform operation might be faster than apply:

namesgrouped / namesgrouped.groupby(level=[0, 1]).transform('sum')
like image 93
IanS Avatar answered Nov 15 '22 15:11

IanS