Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why `sklearn` and `statsmodels` implementation of OLS regression give different R^2?

Accidentally I have noticed, that OLS models implemented by sklearn and statsmodels yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields:

import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm

np.random.seed(42)

N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)

sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)

print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)

print(sklearn.__version__, statsmodels.__version__)

prints:

0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0

Where the difference comes from?

The question differs from Different Linear Regression Coefficients with statsmodels and sklearn as there sklearn.linear_model.LinearModel (with intercept) was fit for X prepared as for statsmodels.api.OLS.

The question differs from Statsmodels: Calculate fitted values and R squared as it addresses difference between two Python packages (statsmodels and scikit-learn) while linked question is about statsmodels and common R^2 definition. They are both answered by the same answer, however that issue has been arleady discussed here: Does the same answer imply that the questions should be closed as duplicate?

like image 439
abukaj Avatar asked Feb 16 '18 18:02

abukaj


People also ask

What is the difference between statsmodels and Sklearn linear regression?

A key difference between the two libraries is how they handle constants. Scikit-learn allows the user to specify whether or not to add a constant through a parameter, while statsmodels' OLS class has a function that adds a constant to a given array.

Does Sklearn linear regression use OLS?

LinearRegression from sklearn uses OLS.

Why do we add constant in statsmodels?

First, we always need to add the constant. The reason for this is that it takes care of the bias in the data (a constant difference which is there for all observations).

Which function in statsmodels is used to run linear regression?

The method we will use to create linear regression models in the Statsmodels library is OLS() .

Do OLS models implemented by sklearn and Statsmodels yield different R^2 values?

Accidentally I have noticed, that OLS models implemented by sklearn and statsmodels yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine.

What is the difference between scikit-learn and Statsmodels linear regression?

To your other two points: Linear regression is in its basic form the same in statsmodels and in scikit-learn. However, the implementation differs which might produce different results in edge cases, and scikit learn has in general more support for larger models. For example, statsmodels currently uses sparse matrices in very few parts.

When should I use Statsmodels over R?

When you need a variety of linear regression models, mixed linear models, regression with discrete dependent variables, and more – StatsModels has options. It also has a syntax much closer to R so, for those who are transitioning to Python, StatsModels is a good choice.

What is the difference between OLS model and linear regression model?

In the OLS model you are using the training data to fit and predict. With the LinearRegression model you are using training data to fit and test data to predict, therefore different results in R2 scores. Show activity on this post.


1 Answers

As pointed by @user333700 in comments, OLS definition of R^2 is different in statsmodels' implementation than in scikit-learn's.

From documentation of RegressionResults class (emphasis mine):

rsquared

R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.

From documentation of LinearRegression.score():

score(X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual

sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

like image 154
abukaj Avatar answered Sep 21 '22 23:09

abukaj