Accidentally I have noticed, that OLS models implemented by sklearn
and statsmodels
yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields:
import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm
np.random.seed(42)
N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)
sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)
print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)
print(sklearn.__version__, statsmodels.__version__)
prints:
0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0
Where the difference comes from?
The question differs from Different Linear Regression Coefficients with statsmodels and sklearn as there sklearn.linear_model.LinearModel
(with intercept) was fit for X prepared as for statsmodels.api.OLS
.
The question differs from
Statsmodels: Calculate fitted values and R squared
as it addresses difference between two Python packages (statsmodels
and scikit-learn
) while linked question is about statsmodels
and common R^2 definition. They are both answered by the same answer, however that issue has been arleady discussed here: Does the same answer imply that the questions should be closed as duplicate?
A key difference between the two libraries is how they handle constants. Scikit-learn allows the user to specify whether or not to add a constant through a parameter, while statsmodels' OLS class has a function that adds a constant to a given array.
LinearRegression from sklearn uses OLS.
First, we always need to add the constant. The reason for this is that it takes care of the bias in the data (a constant difference which is there for all observations).
The method we will use to create linear regression models in the Statsmodels library is OLS() .
Accidentally I have noticed, that OLS models implemented by sklearn and statsmodels yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine.
To your other two points: Linear regression is in its basic form the same in statsmodels and in scikit-learn. However, the implementation differs which might produce different results in edge cases, and scikit learn has in general more support for larger models. For example, statsmodels currently uses sparse matrices in very few parts.
When you need a variety of linear regression models, mixed linear models, regression with discrete dependent variables, and more – StatsModels has options. It also has a syntax much closer to R so, for those who are transitioning to Python, StatsModels is a good choice.
In the OLS model you are using the training data to fit and predict. With the LinearRegression model you are using training data to fit and test data to predict, therefore different results in R2 scores. Show activity on this post.
As pointed by @user333700 in comments, OLS definition of R^2 is different in statsmodels
' implementation than in scikit-learn
's.
From documentation of RegressionResults
class (emphasis mine):
rsquared
R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.
From documentation of LinearRegression.score()
:
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual
sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With