I am working with sklearn and specifically the linear_model module. After fitting a simple linear as in
import pandas as pd
import numpy as np
from sklearn import linear_model
randn = np.random.randn
X = pd.DataFrame(randn(10,3), columns=['X1','X2','X3'])
y = pd.DataFrame(randn(10,1), columns=['Y'])
model = linear_model.LinearRegression()
model.fit(X=X, y=y)
I see how I can access to coefficients and intercept via coef_ and intercept_, prediction is straightforward as well. I would like to access a variance-covariance matrix for the parameters of this simple model, and the standard error of these parameters. I am familiar with R and the vcov() function, and it seems that scipy.optimize has some functionality for this (Getting standard errors on fitted parameters using the optimize.leastsq method in python) - does sklearn have any functionality for accessing these statistics??
Appreciate any help on this.
-Ryan
Linear Regression Theory Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear Regression.
The standard error of the regression (S), also known as the standard error of the estimate, represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable.
Standard error of the regression = (SQRT(1 minus adjusted-R-squared)) x STDEV. S(Y). So, for models fitted to the same sample of the same dependent variable, adjusted R-squared always goes up when the standard error of the regression goes down.
not with scikit-learn, but you can compute this manually with some linear algebra. i do this for your example below.
also here's a jupyter notebook with this code: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31
the standard errors of your estimates are just the square root of the variances of your estimates. what's the variance of your estimate? if you assume your model has gaussian error, it's:
Var(beta_hat) = inverse(X.T @ X) * sigma_squared_hat
and then the standard error of beta_hat[i]
is Var(beta_hat)[i, i] ** 0.5
.
All you have to compute sigma_squared_hat
. This is the estimate of your model's gaussian error. This is not known a priori but can be estimated with the sample variance of your residuals.
Also you need to add an intercept term to your data matrix. Scikit-learn does this automatically with the LinearRegression
class. So to compute this yourself you need to add that to your X matrix or dataframe.
Starting after your code,
print(model.intercept_)
print(model.coef_)
[-0.28671532]
[[ 0.17501115 -0.6928708 0.22336584]]
N = len(X)
p = len(X.columns) + 1 # plus one because LinearRegression adds an intercept term
X_with_intercept = np.empty(shape=(N, p), dtype=np.float)
X_with_intercept[:, 0] = 1
X_with_intercept[:, 1:p] = X.values
beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y.values
print(beta_hat)
[[-0.28671532]
[ 0.17501115]
[-0.6928708 ]
[ 0.22336584]]
y_hat = model.predict(X)
residuals = y.values - y_hat
residual_sum_of_squares = residuals.T @ residuals
sigma_squared_hat = residual_sum_of_squares[0, 0] / (N - p)
var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * sigma_squared_hat
for p_ in range(p):
standard_error = var_beta_hat[p_, p_] ** 0.5
print(f"SE(beta_hat[{p_}]): {standard_error}")
SE(beta_hat[0]): 0.2468580488280805
SE(beta_hat[1]): 0.2965501221823944
SE(beta_hat[2]): 0.3518847753610169
SE(beta_hat[3]): 0.3250760291745124
statsmodels
import statsmodels.api as sm
ols = sm.OLS(y.values, X_with_intercept)
ols_result = ols.fit()
ols_result.summary()
...
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.2867 0.247 -1.161 0.290 -0.891 0.317
x1 0.1750 0.297 0.590 0.577 -0.551 0.901
x2 -0.6929 0.352 -1.969 0.096 -1.554 0.168
x3 0.2234 0.325 0.687 0.518 -0.572 1.019
==============================================================================
yay, done!
No, scikit-learn does not have built error estimates for doing inference. Statsmodels does though.
import statsmodels.api as sm
ols = sm.OLS(y, X)
ols_result = ols.fit()
# Now you have at your disposition several error estimates, e.g.
ols_result.HC0_se
# and covariance estimates
ols_result.cov_HC0
see docs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With