Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe groupby to calculate population standard deviation

I am trying to use groupby and np.std to calculate a standard deviation, but it seems to be calculating a sample standard deviation (with a degrees of freedom equal to 1).

Here is a sample.

#create dataframe
>>> df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
>>> df
   A  B  values
0  1  1      10
1  1  2      15
2  2  1      20
3  2  2      25

#calculate standard deviation using groupby
>>> df.groupby('A').agg(np.std)
      B    values
A                    
1  0.707107  3.535534
2  0.707107  3.535534

#Calculate using numpy (np.std)
>>> np.std([10,15],ddof=0)
2.5
>>> np.std([10,15],ddof=1)
3.5355339059327378

Is there a way to use the population std calculation (ddof=0) with the groupby statement? The records I am using are not (not the example table above) are not samples, so I am only interested in population std deviations.

like image 746
neelshiv Avatar asked Sep 18 '14 14:09

neelshiv


People also ask

How do you find the standard deviation in Groupby pandas?

Pandas Groupby Standard Deviation The following is a step-by-step guide of what you need to do. Group the dataframe on the column(s) you want. Select the field(s) for which you want to estimate the standard deviation. Apply the pandas std() function directly or pass 'std' to the agg() function.

How do you find the standard deviation of grouped data in Python?

Coding a stdev() Function in Python sqrt() to take the square root of the variance. With this new implementation, we can use ddof=0 to calculate the standard deviation of a population, or we can use ddof=1 to estimate the standard deviation of a population using a sample of data.

What is possible using Groupby () method of pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.


2 Answers

You can pass additional args to np.std in the agg function:

In [202]:

df.groupby('A').agg(np.std, ddof=0)

Out[202]:
     B  values
A             
1  0.5     2.5
2  0.5     2.5

In [203]:

df.groupby('A').agg(np.std, ddof=1)

Out[203]:
          B    values
A                    
1  0.707107  3.535534
2  0.707107  3.535534
like image 77
EdChum Avatar answered Sep 28 '22 18:09

EdChum


For degree of freedom = 0

(This means that bins with one number will end up with std=0 instead of NaN)

import numpy as np


def std(x): 
    return np.std(x)


df.groupby('A').agg(['mean', 'max', std])
like image 36
Giorgos Myrianthous Avatar answered Sep 28 '22 19:09

Giorgos Myrianthous