While working on some statistical analysis tools, I discovered there are at least 3 Python methods to calculate mean and standard deviation (not counting the "roll your own" techniques):
np.mean()
, np.std()
(with ddof=0 or 1)statistics.mean()
, statistics.pstdev()
(and/or statistics.stdev
) scipy.statistics
packageThat has me scratching my head. There should be one obvious way to do it, right? :-) I've found some older SO posts. One compares the performance advantages of np.mean()
vs statistics.mean()
. It also highlights differences in the sum operator. That post is here:
why-is-statistics-mean-so-slow
I am working with numpy
array data, and my values fall in a small range (-1.0 to 1.0, or 0.0 to 10.0), so the numpy
functions seem the obvious answer for my application. They have a good balance of speed, accuracy, and ease of implementation for the data I will be processing.
It appears the statistics
module is primarily for those that have data in lists (or other forms), or for widely varying ranges [1e+5, 1.0, 1e-5]
. Is that still a fair statement? Are there any numpy
enhancements that address the differences in the sum operator? Do recent developments bring any other advantages?
Numerical algorithms generally have positive and negative aspects: some are faster, or more accurate, or require a smaller memory footprint. When faced with a choice of 3-4 ways to do a calculation, a developer's responsibility is to select the "best" method for his/her application. Generally this is a balancing act between competing priorities and resources.
My intent is to solicit replies from programmers experienced in statistical analysis to provide insights into the strengths and weaknesses of the methods above (or other/better methods). [I'm not interested in speculation or opinions without supporting facts.] I will make my own decision based on my design requirements.
Why does NumPy duplicate features of SciPy?
From the SciPy FAQ What is the difference between NumPy and SciPy?:
In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, etc. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors.
It recommends using SciPy over NumPy:
In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with Python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.
When should I use the statistics library?
From the statistics library documentation:
The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.
Thus I would not use it for serious (i.e. resource intensive) computation.
What is the difference between statsmodels and SciPy?
From the statsmodels about page:
The models module of scipy.stats was originally written by Jonathan Taylor. For some time it was part of scipy but was later removed. During the Google Summer of Code 2009, statsmodels was corrected, tested, improved and released as a new package. Since then, the statsmodels development team has continued to add new models, plotting tools, and statistical methods.
Thus you may have a requirement that SciPy is not able to fulfill, or is better fulfilled by a dedicated library.
For example the SciPy documentation for scipy.stats.probplot
notes that
Statsmodels has more extensive functionality of this type, see
statsmodels.api.ProbPlot
.
Thus in cases like these you will need to turn to statistical libraries beyond SciPy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With