I got two snippets code as follows.
import numpy
numpy.std([766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346])
0
and
import pandas as pd
pd.Series([766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346]).std(ddof=0)
10.119288512538814
That's a huge difference.
May I ask why?
Series as generalized NumPy array The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
The indexing of pandas series is significantly slower than the indexing of NumPy arrays. The indexing of NumPy arrays is much faster than the indexing of Pandas arrays.
Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays. Indexing of numpy Arrays is very fast.
In pandas, the std() function is used to find the standard Deviation of the series. The mean can be simply defined as the average of numbers. In pandas, the mean() function is used to find the mean of the series.
This issue is indeed already under discussion (link); problem seems to be the algorithm for calculating the standard deviation which is used by pandas
since it is not as numerically stable as the one used by numpy
.
An easy workaround would be to apply .values
to the series first and then apply std
to these values; in this case numpy's
std
is used:
pd.Series([766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346]).values.std()
which gives you the expected value 0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With