I wanted to groupby by 'label' and 'month' to sum the Quantity sold for each month and for each label.
I am trying to do 'groupby and apply' method for achieving this, but not sure how to count the month for each label. Say, for label value AFFLELOU (DOS), I have two values for 7th month. so, I should sum the quantity sold and divide by 2. for 9th and 10th month, I just have one value, so the count would be 1 and it would divide the quantity sold.
I wrote the code below, but it doesn't take count as a function and return count not defined error.
t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold']
.sum()/count('month'))
Can someone tell me how to get the count value of each month for each label?
Thanks in Advance.
Use count() by Column Namegroupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well. The below example does the grouping on Courses column and calculates count how many times each value is present.
The most simple method for pandas groupby count is by using the in-built pandas method named size(). It returns a pandas series that possess the total number of row count for each group. The basic working of the size() method is the same as len() method and hence, it is not affected by NaN values in the dataset.
Instead of summing, counting and dividing, you could use agg('mean')
:
t1.groupby(['label', 'month'])['Quantity sold'].agg('mean')
Or, if you do wish to retain the sum and count, use:
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean'])
For example,
import numpy as np
import pandas as pd
t1 = pd.DataFrame(np.random.randint(4, size=(20,3)), columns=['label', 'Quantity sold', 'month'])
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean'])
yields a DataFrame like
sum count mean
label month
0 1 2 1 2.00
2 0 1 0.00
3 2 2 1.00
1 1 1 2 0.50
2 3 1 3.00
3 1 1 1.00
2 0 0 1 0.00
1 0 3 0.00
3 5 4 1.25
3 0 1 1 1.00
1 0 1 0.00
2 0 1 0.00
3 3 1 3.00
Using groupby/agg
with its builtin aggregators sum
, count
and mean
is clearly more convenient here, but if you did need to use groupby/apply
with a custom function you could use:
t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold'].sum()/len(x))
Note that while calling custom functions with groupby/apply
gives you more flexibility, it comes at a cost because calling a custom Python function once for each group is generally slower than calling the builtin Cythonized aggregators available in groupby/agg
.
If you have missing (NaN
) values in Quantity sold
, it may help to know that group/agg
has both 'count'
and 'size'
aggregators:
'count'
returns the number of non-NaN
values'size'
returns the length of the group (including NaN
values)The count
is always less than or equal to the size
. The mean
is the sum
(of the non-NaN
values) divided by the count
. To see the difference between count
and size
, you could experiment with this code:
np.random.seed(2018)
t1 = pd.DataFrame(np.random.randint(4, size=(50,3)), columns=['label', 'Quantity sold', 'month'])
t1.loc[np.random.choice([True, False], len(t1)), 'Quantity sold'] = np.nan
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'size', 'mean'])
which yields
sum count size mean
label month
0 1 0.0 0 3 NaN
2 6.0 2 2 3.000000
3 0.0 0 1 NaN
1 0 3.0 2 5 1.500000
1 0.0 0 1 NaN
2 5.0 3 5 1.666667
3 0.0 2 3 0.000000
2 0 7.0 3 5 2.333333
1 4.0 4 8 1.000000
2 5.0 2 3 2.500000
3 5.0 2 3 2.500000
3 0 1.0 2 5 0.500000
1 3.0 1 1 3.000000
2 2.0 1 2 2.000000
3 2.0 1 3 2.000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With