Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different std in pandas vs numpy

The standard deviation differs between pandas and numpy. Why and which one is the correct one? (the relative difference is 3.5% which should not come from rounding, this is high in my opinion).

Example

import numpy as np import pandas as pd from StringIO import StringIO  a='''0.057411 0.024367  0.021247 -0.001809 -0.010874 -0.035845 0.001663 0.043282 0.004433 -0.007242 0.029294 0.023699 0.049654 0.034422 -0.005380'''   df = pd.read_csv(StringIO(a.strip()), delim_whitespace=True, header=None)  df.std()==np.std(df) # False df.std() # 0.025801 np.std(df) # 0.024926  (0.024926 - 0.025801) / 0.024926 # 3.5% relative difference 

I use these versions:

pandas '0.14.0' numpy '1.8.1' 
like image 715
Mannaggia Avatar asked Jul 27 '14 18:07

Mannaggia


People also ask

How is NumPy different from Pandas?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

What is the STD in NumPy in Python?

std() in Python. numpy. std(arr, axis = None) : Compute the standard deviation of the given data (array elements) along the specified axis(if any).. Standard Deviation (SD) is measured as the spread of data distribution in the given data set.

Is Pandas and NumPy the same?

What is Pandas? Similar to NumPy, Pandas is one of the most widely used python libraries in data science. It provides high-performance, easy to use structures and data analysis tools. Unlike NumPy library which provides objects for multi-dimensional arrays, Pandas provides in-memory 2d table object called Dataframe.

What is the difference between NumPy array and Pandas series?

The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.


2 Answers

In a nutshell, neither is "incorrect". Pandas uses the unbiased estimator (N-1 in the denominator), whereas Numpy by default does not.

To make them behave the same, pass ddof=1 to numpy.std().

For further discussion, see

  • Can someone explain biased/unbiased population/sample standard deviation?
  • Population variance and sample variance.
  • Why divide by n-1?
like image 128
NPE Avatar answered Sep 23 '22 11:09

NPE


For pandas to performed the same as numpy, you can pass in the ddof=0 parameter, so df.std(ddof=0).

This short video explains quite well why n-1 might be preferred for samples. https://www.youtube.com/watch?v=Cn0skMJ2F3c

like image 34
Xuan Avatar answered Sep 21 '22 11:09

Xuan