Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is pandas.Series.std() different from numpy.std()?

This is what I am trying to explain:

>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575

Answer: this is explained by Bessel's correction, N-1 instead of N in the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.


There is a related discussion here, but their suggestions do not work either.

I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):

>>> df
restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

Question: r.mi.groupby('restaurant_id')['price'].mean() returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std() returns wrong values.

As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:

>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575

We can get the same (correct) values with

>>> np.mean(df)
restaurant_id    10407.00
price               17.75
dtype: float64
>>> np.std(df)
restaurant_id    0.000000
price            6.259992
dtype: float64

(Of course, disregard the mean restaurant id.) Obviously, np.std(df) is not a solution when I have more than one restaurant. So I am using groupby.

>>> df.groupby('restaurant_id').agg('std')
                  price
restaurant_id          
10407          7.228416

What?! 7.228416 is not 6.259992.

Let's try again.

>>> df.groupby('restaurant_id').std()

Same thing.

>>> df.groupby('restaurant_id')['price'].std()

Same thing.

>>> df.groupby('restaurant_id').apply(lambda x: x.std())

Same thing.

However, this works:

for id, group in df.groupby('restaurant_id'):
  print id, np.std(group['price'])

Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?

like image 739
Sergey Orshanskiy Avatar asked Sep 06 '14 01:09

Sergey Orshanskiy


People also ask

What is a Pandas series How is it different from a NumPy array?

A numpy array is a grid of values that belong to the same data type. NumPy arrays are created using the array() function. A Pandas Series is a one-dimensional labeled array that can store data of any type. It is created using the Series() function of the Pandas library.

Are Pandas and NumPy the same?

For Data Scientists, Pandas and Numpy are both essential tools in Python. We know Numpy runs vector and matrix operations very efficiently, while Pandas provides the R-like data frames allowing intuitive tabular data analysis. A consensus is that Numpy is more optimized for arithmetic computations.

What is the relationship between NumPy and Pandas?

Pandas is defined as an open-source library that provides high-performance data manipulation in Python. It is built on top of the NumPy package, which means Numpy is required for operating the Pandas. The name of Pandas is derived from the word Panel Data, which means an Econometrics from Multidimensional data.

What does STD do in NumPy?

std. Compute the standard deviation along the specified axis. Returns the standard deviation, a measure of the spread of a distribution, of the array elements.


1 Answers

I see. Pandas is using Bessel's correction by default -- that is, the standard deviation formula with N-1 instead of N in the denominator. As behzad.nouri has pointed out in the comments,

pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])
like image 108
Sergey Orshanskiy Avatar answered Sep 17 '22 12:09

Sergey Orshanskiy