This is what I am trying to explain: <pre class="prettyprint"><code>>>> pd.Series([7,20,22,22]).std() 7.2284161474004804 >>> np.std([7,20,22,22]) 6.2599920127744575 </code></pre> Answer: this is explained by Bessel's correction, <code>N-1</code> instead of <code>N</code> in the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy. <hr> There is a related discussion here, but their suggestions do not work either. I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one): <pre class="prettyprint"><code>>>> df restaurant_id price id 1 10407 7 3 10407 20 6 10407 22 13 10407 22 </code></pre> Question: <code>r.mi.groupby('restaurant_id')['price'].mean()</code> returns price means for each restaurant. I want to get the standard deviations. However, <code>r.mi.groupby('restaurant_id')['price'].std()</code> returns wrong values. As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure: <pre class="prettyprint"><code>>>> np.mean([7,20,22,22]) 17.75 >>> np.std([7,20,22,22]) 6.2599920127744575 </code></pre> We can get the same (correct) values with <pre class="prettyprint"><code>>>> np.mean(df) restaurant_id 10407.00 price 17.75 dtype: float64 >>> np.std(df) restaurant_id 0.000000 price 6.259992 dtype: float64 </code></pre> (Of course, disregard the mean restaurant id.) Obviously, <code>np.std(df)</code> is not a solution when I have more than one restaurant. So I am using <code>groupby</code>. <pre class="prettyprint"><code>>>> df.groupby('restaurant_id').agg('std') price restaurant_id 10407 7.228416 </code></pre> What?! 7.228416 is not 6.259992. Let's try again. <pre class="prettyprint"><code>>>> df.groupby('restaurant_id').std() </code></pre> Same thing. <pre class="prettyprint"><code>>>> df.groupby('restaurant_id')['price'].std() </code></pre> Same thing. <pre class="prettyprint"><code>>>> df.groupby('restaurant_id').apply(lambda x: x.std()) </code></pre> Same thing. However, this works: <pre class="prettyprint"><code>for id, group in df.groupby('restaurant_id'): print id, np.std(group['price']) </code></pre> Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?

I see. Pandas is using Bessel's correction by default -- that is, the standard deviation formula with <code>N-1</code> instead of <code>N</code> in the denominator. As behzad.nouri has pointed out in the comments, <pre class="prettyprint"><code>pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22]) </code></pre>

Why is pandas.Series.std() different from numpy.std()?

Tags:

python

pandas

group-by

numpy

statistics

This is what I am trying to explain:

>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575

Answer: this is explained by Bessel's correction, N-1 instead of N in the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.

There is a related discussion here, but their suggestions do not work either.

I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):

>>> df
restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

Question: r.mi.groupby('restaurant_id')['price'].mean() returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std() returns wrong values.

As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:

>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575

We can get the same (correct) values with

>>> np.mean(df)
restaurant_id    10407.00
price               17.75
dtype: float64
>>> np.std(df)
restaurant_id    0.000000
price            6.259992
dtype: float64

(Of course, disregard the mean restaurant id.) Obviously, np.std(df) is not a solution when I have more than one restaurant. So I am using groupby.

>>> df.groupby('restaurant_id').agg('std')
                  price
restaurant_id          
10407          7.228416

What?! 7.228416 is not 6.259992.

Let's try again.

>>> df.groupby('restaurant_id').std()

Same thing.

>>> df.groupby('restaurant_id')['price'].std()

Same thing.

>>> df.groupby('restaurant_id').apply(lambda x: x.std())

Same thing.

However, this works:

for id, group in df.groupby('restaurant_id'):
  print id, np.std(group['price'])

Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?

739

asked Sep 06 '14 01:09

Sergey Orshanskiy

1 Answers

I see. Pandas is using Bessel's correction by default -- that is, the standard deviation formula with N-1 instead of N in the denominator. As behzad.nouri has pointed out in the comments,

pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])

108

answered Sep 17 '22 12:09

Sergey Orshanskiy

Related questions
                            
                                How to randomly delete a number of lines from a big file?
                            
                                How do I upgrade a dependency in a Python project on Heroku
                            
                                Parsing gettext `.po` files with python
                            
                                Argparse - Custom Action With No Argument?
                            
                                Adobe Photoshop-style posterization and OpenCV
                            
                                Modify INI file with Python
                            
                                Reading excel files with xlrd
                            
                                Define a python dictionary with immutable keys but mutable values
                            
                                pymongo default database connection
                            
                                How to set value of a ManyToMany field in Django?
                            
                                Matplotlib colorbar background and label placement
                            
                                How to set settings.LOGIN_URL to a view function name in Django 1.5+
                            
                                Django ModelChoiceField has no plus button
                            
                                Matplotlib imshow/matshow display values on plot
                            
                                Mayavi points3d with different size and colors
                            
                                Python mysql (using pymysql) auto reconnect
                            
                                PyDrive: cannot write file to specific GDrive folder
                            
                                Dot-boxplots from DataFrames
                            
                                run selenium with crontab (python)
                            
                                Indexes of elements in NumPy array that satisfy conditions on the value and the index

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is pandas.Series.std() different from numpy.std()?

Tags:

python

pandas

group-by

numpy

statistics

Sergey Orshanskiy

People also ask

1 Answers

Sergey Orshanskiy

Recent Activity

Donate For Us