I am running a regression as follows (df
is a pandas
dataframe):
import statsmodels.api as sm
est = sm.OLS(df['p'], df[['e', 'varA', 'meanM', 'varM', 'covAM']]).fit()
est.summary()
Which gave me, among others, an R-squared of 0.942
. So then I wanted to plot the original y-values
and the fitted values. For this, I sorted the original values:
orig = df['p'].values
fitted = est.fittedvalues.values
args = np.argsort(orig)
import matplotlib.pyplot as plt
plt.plot(orig[args], 'bo')
plt.plot(orig[args]-resid[args], 'ro')
plt.show()
This, however, gave me a graph where the values were completely off. Nothing that would suggest an R-squared of 0.9
. Therefore, I tried to calculate it manually myself:
yBar = df['p'].mean()
SSTot = df['p'].apply(lambda x: (x-yBar)**2).sum()
SSReg = ((est.fittedvalues - yBar)**2).sum()
1 - SSReg/SSTot
Out[79]: 0.2618159806908984
Am I doing something wrong? Or is there a reason why my computation is so far off what statsmodels is getting? SSTot
, SSReg
have values of 48084
, 35495
.
Adjusted R-squared. This is defined here as 1 - ( nobs -1)/ df_resid * (1- rsquared ) if a constant is included and 1 - nobs / df_resid * (1- rsquared ) if no constant is included.
A fitted value is a statistical model's prediction of the mean response value when you input the values of the predictors, factor levels, or components into the model. Suppose you have the following regression equation: y = 3X + 5. If you enter a value of 5 for the predictor, the fitted value is 20.
A key difference between the two libraries is how they handle constants. Scikit-learn allows the user to specify whether or not to add a constant through a parameter, while statsmodels' OLS class has a function that adds a constant to a given array.
R square with NumPy libraryCalculate the Correlation matrix using numpy. corrcoef() function. Slice the matrix with indexes [0,1] to fetch the value of R i.e. Coefficient of Correlation . Square the value of R to get the value of R square.
If you do not include an intercept (constant explanatory variable) in your model, statsmodels computes R-squared based on un-centred total sum of squares, ie.
tss = (ys ** 2).sum() # un-centred total sum of squares
as opposed to
tss = ((ys - ys.mean())**2).sum() # centred total sum of squares
as a result, R-squared would be much higher.
This is mathematically correct. Because, R-squared should indicate how much of the variation is explained by the full-model comparing to the reduced model. If you define your model as:
ys = beta1 . xs + beta0 + noise
then the reduced model can be: ys = beta0 + noise
, where the estimate for beta0
is the sample average, thus we have: noise = ys - ys.mean()
. That is where de-meaning comes from in a model with intercept.
But from a model like:
ys = beta . xs + noise
you may only reduce to: ys = noise
. Since noise
is assumed zero-mean, you may not de-mean ys
. Therefore, unexplained variation in the reduced model is the un-centred total sum of squares.
This is documented here under rsquared
item. Set yBar
equal to zero, and I would expect you will get the same number.
If your model is:
a = <yourmodel>.fit()
Then, to compute fitted values:
a.fittedvalues
and to compute R squared:
a.rsquared
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With