Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to divide the sum of a groupby value with the count the another value

I wanted to groupby by 'label' and 'month' to sum the Quantity sold for each month and for each label.

Dataset

I am trying to do 'groupby and apply' method for achieving this, but not sure how to count the month for each label. Say, for label value AFFLELOU (DOS), I have two values for 7th month. so, I should sum the quantity sold and divide by 2. for 9th and 10th month, I just have one value, so the count would be 1 and it would divide the quantity sold.

I wrote the code below, but it doesn't take count as a function and return count not defined error.

t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold'] 
.sum()/count('month'))

Can someone tell me how to get the count value of each month for each label?

Thanks in Advance.

like image 422
vishnu prashanth Avatar asked Mar 27 '18 13:03

vishnu prashanth


People also ask

How do you use Groupby and count?

Use count() by Column Namegroupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well. The below example does the grouping on Courses column and calculates count how many times each value is present.

How do you count rows in Groupby?

The most simple method for pandas groupby count is by using the in-built pandas method named size(). It returns a pandas series that possess the total number of row count for each group. The basic working of the size() method is the same as len() method and hence, it is not affected by NaN values in the dataset.


1 Answers

Instead of summing, counting and dividing, you could use agg('mean'):

t1.groupby(['label', 'month'])['Quantity sold'].agg('mean') 

Or, if you do wish to retain the sum and count, use:

t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean']) 

For example,

import numpy as np
import pandas as pd

t1 = pd.DataFrame(np.random.randint(4, size=(20,3)), columns=['label', 'Quantity sold', 'month'])
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean']) 

yields a DataFrame like

             sum  count  mean
label month                  
0     1        2      1  2.00
      2        0      1  0.00
      3        2      2  1.00
1     1        1      2  0.50
      2        3      1  3.00
      3        1      1  1.00
2     0        0      1  0.00
      1        0      3  0.00
      3        5      4  1.25
3     0        1      1  1.00
      1        0      1  0.00
      2        0      1  0.00
      3        3      1  3.00

Using groupby/agg with its builtin aggregators sum, count and mean is clearly more convenient here, but if you did need to use groupby/apply with a custom function you could use:

t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold'].sum()/len(x))

Note that while calling custom functions with groupby/apply gives you more flexibility, it comes at a cost because calling a custom Python function once for each group is generally slower than calling the builtin Cythonized aggregators available in groupby/agg.


If you have missing (NaN) values in Quantity sold, it may help to know that group/agg has both 'count' and 'size' aggregators:

  • 'count' returns the number of non-NaN values
  • 'size' returns the length of the group (including NaN values)

The count is always less than or equal to the size. The mean is the sum (of the non-NaN values) divided by the count. To see the difference between count and size, you could experiment with this code:

np.random.seed(2018)
t1 = pd.DataFrame(np.random.randint(4, size=(50,3)), columns=['label', 'Quantity sold', 'month'])
t1.loc[np.random.choice([True, False], len(t1)), 'Quantity sold'] = np.nan
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'size', 'mean']) 

which yields

             sum  count  size      mean
label month                            
0     1      0.0      0     3       NaN
      2      6.0      2     2  3.000000
      3      0.0      0     1       NaN
1     0      3.0      2     5  1.500000
      1      0.0      0     1       NaN
      2      5.0      3     5  1.666667
      3      0.0      2     3  0.000000
2     0      7.0      3     5  2.333333
      1      4.0      4     8  1.000000
      2      5.0      2     3  2.500000
      3      5.0      2     3  2.500000
3     0      1.0      2     5  0.500000
      1      3.0      1     1  3.000000
      2      2.0      1     2  2.000000
      3      2.0      1     3  2.000000
like image 77
unutbu Avatar answered Oct 10 '22 03:10

unutbu