I am trying to use groupby and np.std to calculate a standard deviation, but it seems to be calculating a sample standard deviation (with a degrees of freedom equal to 1). Here is a sample. <pre class="prettyprint"><code>#create dataframe >>> df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)}) >>> df A B values 0 1 1 10 1 1 2 15 2 2 1 20 3 2 2 25 #calculate standard deviation using groupby >>> df.groupby('A').agg(np.std) B values A 1 0.707107 3.535534 2 0.707107 3.535534 #Calculate using numpy (np.std) >>> np.std([10,15],ddof=0) 2.5 >>> np.std([10,15],ddof=1) 3.5355339059327378 </code></pre> Is there a way to use the population std calculation (ddof=0) with the groupby statement? The records I am using are not (not the example table above) are not samples, so I am only interested in population std deviations.

For <code>degree of freedom = 0</code> (This means that bins with one number will end up with <code>std=0</code> instead of <code>NaN</code>) <pre class="prettyprint"><code>import numpy as np def std(x): return np.std(x) df.groupby('A').agg(['mean', 'max', std]) </code></pre>

Pandas dataframe groupby to calculate population standard deviation

Tags:

python

pandas

numpy

statistics

I am trying to use groupby and np.std to calculate a standard deviation, but it seems to be calculating a sample standard deviation (with a degrees of freedom equal to 1).

Here is a sample.

#create dataframe
>>> df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
>>> df
   A  B  values
0  1  1      10
1  1  2      15
2  2  1      20
3  2  2      25

#calculate standard deviation using groupby
>>> df.groupby('A').agg(np.std)
      B    values
A                    
1  0.707107  3.535534
2  0.707107  3.535534

#Calculate using numpy (np.std)
>>> np.std([10,15],ddof=0)
2.5
>>> np.std([10,15],ddof=1)
3.5355339059327378

Is there a way to use the population std calculation (ddof=0) with the groupby statement? The records I am using are not (not the example table above) are not samples, so I am only interested in population std deviations.

746

asked Sep 18 '14 14:09

neelshiv

2 Answers

You can pass additional args to np.std in the agg function:

In [202]:

df.groupby('A').agg(np.std, ddof=0)

Out[202]:
     B  values
A             
1  0.5     2.5
2  0.5     2.5

In [203]:

df.groupby('A').agg(np.std, ddof=1)

Out[203]:
          B    values
A                    
1  0.707107  3.535534
2  0.707107  3.535534

answered Sep 28 '22 18:09

EdChum

For degree of freedom = 0

(This means that bins with one number will end up with std=0 instead of NaN)

import numpy as np


def std(x): 
    return np.std(x)


df.groupby('A').agg(['mean', 'max', std])

answered Sep 28 '22 19:09

Giorgos Myrianthous

Related questions
                            
                                Python: Why is __getattr__ catching AttributeErrors?
                            
                                Why does this "[::-1]" return a reversed list in Python? [duplicate]
                            
                                how to import matplotlib in python
                            
                                Finding the (x,y) indexes of specific (R,G,B) color values from images stored in NumPy ndarrays
                            
                                Django and virtualenv - Adding to git repo [duplicate]
                            
                                Inconsistent use of tabs and spaces in indentation
                            
                                Faster way to loop through every pixel of an image in Python?
                            
                                If RAM isn't a concern, is reading line by line faster or reading everything into RAM and access it? - Python
                            
                                What is the recommended size of indentation in Python?
                            
                                Disabled field is considered for validation in WTForms and Flask
                            
                                What is Python's equivalent of Java's standard for-loop?
                            
                                FTP upload files Python
                            
                                How to retrieve the values of dynamic html content using Python
                            
                                How to store python dictionary in to mysql DB through python
                            
                                OpenCV-Python dense SIFT
                            
                                Multivariate kernel density estimation in Python
                            
                                Pass a 2d numpy array to c using ctypes
                            
                                Installing Python Requests
                            
                                Can't pickle <type 'instancemethod'> using python's multiprocessing Pool.apply_async()
                            
                                matplotlib.pyplot.subplots() - how to set the name of the figure?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With