Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference(s) between scipy.stats.linregress, numpy.polynomial.polynomial.polyfit and statsmodels.api.OLS

It seems all three functions can do simple linear regression, e.g.

scipy.stats.linregress(x, y)

numpy.polynomial.polynomial.polyfit(x, y, 1)

x = statsmodels.api.add_constant(x)
statsmodels.api.OLS(y, x)

I wonder if there is any real difference between the three methods? I know that statsmodels are built on top of scipy, and scipy is kinda dependent on numpy for many things, so I expect that they should not differ vastly, but devil is always in the details.

More specifically, if we use the numpy method above, how do we get the p-value of the slope which is given by default by the other two methods?

I am using them in Python 3, if that makes any difference.

like image 993
MLister Avatar asked Jun 29 '15 22:06

MLister


People also ask

Is Numpy Polyfit linear regression?

Linear regression is the first step to learn data science. So even if you are new in this field, you have to understand these concepts because these algorithms are mostly used by data science researchers. These algorithms are also easy to understand to start the machine learning journey.

What is Numpy Polyfit?

Introduction to NumPy polyfit. In python, Numpy polyfit() is a method that fits the data within a polynomial function. That is, it least squares the function polynomial fit. For example, a polynomial p(X) of deg degree fits the coordinate points (X, Y).

How do you fit a polynomial to a data point in Python?

polyfit() in Python Numpy. The method returns the Polynomial coefficients ordered from low to high. If y was 2-D, the coefficients in column k of coef represent the polynomial fit to the data in y's k-th column. The parameter, x are the x-coordinates of the M sample (data) points (x[i], y[i]).


1 Answers

The three are very different but overlap in the parameter estimation for the very simple example with only one explanatory variable.

By increasing generality:

scipy.stats.linregress only handles the case of a single explanatory variable with specialized code and calculates a few extra statistics.

numpy.polynomial.polynomial.polyfit estimates the regression for a polynomial of a single variable, but doesn't return much in terms of extra statisics.

statsmodels OLS is a generic linear model (OLS) estimation class. It doesn't prespecify what the explanatory variables are and can handle any multivariate array of explanatory variables, or formulas and pandas DataFrames. It not only returns the estimated parameters, but also a large set of results staistics and methods for statistical inference and prediction.

For completeness of options for estimating linear models in Python (outside of Bayesian analysis), we should also consider scikit-learn LinearRegression and similar linear models, which are useful for selecting among a large number of explanatory variables but does not have the large number of results that statsmodels provides.

like image 138
Josef Avatar answered Sep 21 '22 19:09

Josef