Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different results when computing linear regressions with scipy.stats and statsmodels

I'm getting different values of r^2 (coefficient of determination) when I try OLS fits with these two libraries and I can't quite figure out why. (Some spacing removed for your convenience)

In [1]: import pandas as pd       
In [2]: import numpy as np
In [3]: import statsmodels.api as sm
In [4]: import scipy.stats
In [5]: np.random.seed(100)
In [6]: x = np.linspace(0, 10, 100) + 5*np.random.randn(100)
In [7]: y = np.arange(100)

In [8]: slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)

In [9]: r**2
Out[9]: 0.22045988449873671

In [10]: model = sm.OLS(y, x)
In [11]: est = model.fit()

In [12]: est.rsquared
Out[12]: 0.5327910685035413

What is going on here? I can't figure it out! Is there an error somewhere?

like image 808
James Avatar asked Sep 30 '22 16:09

James


1 Answers

This is not an answer to the original question which has been answered.

About R-squared in a regression without a constant.

One problem is that a regression without an intercept doesn't have the standard definition of R^2.

Essentially, R-squared as a goodness of fit measure in a model with an intercept compares the full model with the model that has only an intercept. If the full model does not have an intercept, then the standard definition of R^2 can produce weird results like negative R^2.

The conventional definition in the regression without constant divides by the total sum of squares of the dependent variable instead of the demeaned. The R^2 between a regression with a constant and without cannot really be compared in a meaningful way.

see for example the issue that triggered the change in statsmodels to handle R^2 "correctly" in the no-constant regression: https://github.com/statsmodels/statsmodels/issues/785

like image 166
Josef Avatar answered Oct 05 '22 10:10

Josef