Difference between Linear Regression Coefficients between Python and R

Tags:

I'm trying to run a linear regression in Python that I have already done in R in order to find variables with 0 coefficients. The issue I'm running into is that the linear regression in R returns NAs for columns with low variance while the scikit learn regression returns the coefficients. In the R code, I find and save these variables by saving the variables with NAs as output from the linear regression, but I can't seem to figure out a way to mimic this behavior in python. The code I'm using can be found below.

R Code:

a <- c(23, 45, 546, 42, 68, 15, 47)
b <- c(1, 2, 4, 6, 34, 2, 8)
c <- c(22, 33, 44, 55, 66, 77, 88)
d <- c(1, 1, 1, 1, 1, 1, 1)
e <- c(1, 1, 1, 1, 1, 1, 1.1)
f <- c(1, 1, 1, 1, 1, 1, 1.01)
g <- c(1, 1, 1, 1, 1, 1, 1.001)

df <- data.frame(a, b, c, d, e, f, g)
var_list = c('b', 'c', 'd', 'e', 'f', 'g')

target <- temp_dsin.df$a
reg_data <- cbind(target, df[, var_list])


if (nrow(reg_data) < length(var_list)){
  message(paste0('    WARNING: Data set is rank deficient. Result may be doubtful'))
}
reg_model <- lm(target ~ ., data = reg_data)

print(reg_model$coefficients)

#store the independent variables with 0 coefficients
zero_coef_IndepVars.v <- names(which(is.na(reg_model$coefficients)))

print(zero_coef_IndepVars.v)

Python Code:

import pandas as pd
from sklearn import linear_model

a = [23, 45, 546, 42, 68, 15, 47]
b = [1, 2, 4, 6, 34, 2, 8]
c = [22, 33, 44, 55, 66, 77, 88]
d = [1, 1, 1, 1, 1, 1, 1]
e = [1, 1, 1, 1, 1, 1, 1.1]
q = [1, 1, 1, 1, 1, 1, 1.01]
f = [1, 1, 1, 1, 1, 1, 1.001]


df = pd.DataFrame({'a': a,
                             'b': b,
                             'c': c,
                             'd': d,
                             'e': e,
                             'f': q,
                             'g': f})


var_list = ['b', 'c', 'd', 'e', 'f', 'g']

# build linear regression model and test for linear combination
target = df['a']
reg_data = pd.DataFrame()
reg_data['a'] = target
train_cols = df.loc[:,df.columns.str.lower().isin(var_list)]


if reg_data.shape[0] < len(var_list):
    print('    WARNING: Data set is rank deficient. Result may be doubtful')

# Create linear regression object
reg_model = linear_model.LinearRegression()

# Train the model using the training sets
reg_model.fit(train_cols , reg_data['a'])

print(reg_model.coef_)

Output from R:

(Intercept)           b           c           d           e           f           g 
 537.555988   -0.669253   -1.054719          NA -356.715149          NA          NA 

> print(zero_coef_IndepVars.v)
[1] "d" "f" "g"

Output from Python:

           b             c   d               e              f            g
[-0.66925301   -1.05471932   0.   -353.1483504   -35.31483504   -3.5314835]

As you can see, the values for columns 'b', 'c', and 'e' are close, but very different for 'd', 'f', and 'g'. For this example regression, I would want to return ['d', 'f', 'g'] as their outputs are NA from R. The issue is that the sklearn linear regression returns 0 for col 'd', while it returns -35.31 for col 'f' and -3.531 for col 'g'.

Does anyone know how R decides on whether to return NA or a value/how to implement this behavior into the Python version? Knowing where the differences stem from will likely help me implement the R behavior in python. I need the results of the python script to match the R outputs exactly.

767

asked Apr 20 '17 16:04

Nizag

2 Answers

It's a difference in implementation. lm in R uses underlying C code that is based on a QR decomposition. The model matrix is decomposed into an orthogonal matrix Q and a triangular matrix R. This causes what others called "a check on collinearity". R doesn't check that, the nature of the QR decomposition assures that the least collinear variables are getting "priority" in the fitting algorithm.

More info on QR decomposition in the context of linear regression: https://www.stat.wisc.edu/~larget/math496/qr.html

The code from sklearn is basically a wrapper around numpy.linalg.lstsq, which minimizes the Euclidean quadratic norm. If your model is Y = AX, it minimizes ||Y - AX||^2. This is a different (and computationally less stable) algorithm, and it doesn't have the nice side effect of the QR decomposition.

Personal note: if you want to have robust fitting of models in a proven and tested computational framework and insist on using Python, look for linear regression implementations that are based on QR or SVD. The packages scikit-learn or statsmodels (still in beta as per 22 april 2017) should get you there.

108

answered Sep 22 '22 11:09

Joris Meys

I guess there's is not data enough. This is the result of statsmodel:

import statsmodels.formula.api as smf
lm = smf.ols(formula='a ~ b + c + d + e + f + g', data=df).fit()
lm.summary()

gives :

OLS Regression Results
Dep. Variable:	a	R-squared:	0.038
Model:	OLS	Adj. R-squared:	-0.923
Method:	Least Squares	F-statistic:	0.03993
Date:	Fri, 21 Apr 2017	Prob (F-statistic):	0.987
Time:	22:29:16	Log-Likelihood:	-46.059
No. Observations:	7	AIC:	100.1
Df Residuals:	3	BIC:	99.90
Df Model:	3		
Covariance Type:	nonrobust		
coef	std err	t	P>|t|	[95.0% Conf. Int.]
Intercept	151.5350	1065.536	0.142	0.896	-3239.476 3542.545
b	-0.6693	10.324	-0.065	0.952	-33.526 32.188
c	-1.0547	6.412	-0.164	0.880	-21.462 19.352
d	151.5350	1065.536	0.142	0.896	-3239.476 3542.545
e	-368.1353	3862.592	-0.095	0.930	-1.27e+04 1.19e+04
f	99.5679	574.110	0.173	0.873	-1727.506 1926.642
g	146.3383	1016.341	0.144	0.895	-3088.111 3380.788
Omnibus:	nan	Durbin-Watson:	2.447
Prob(Omnibus):	nan	Jarque-Bera (JB):	4.545
Skew:	1.797	Prob(JB):	0.103
Kurtosis:	4.632	Cond. No.	1.34e+18

OLS gives several clues this lineair problem is ill conditioned.

answered Sep 25 '22 11:09

cast42

Related questions
                            
                                Configuring PyCharm with existing virtualenv
                            
                                ValueError: failed to parse CPython sys.version after using conda command
                            
                                How to access python enums in protobufs
                            
                                How do I customize the display of edge labels in networkx?
                            
                                Flask app context for sqlalchemy
                            
                                Python pandas dataframe interpolate missing data
                            
                                How to create OS X app with Python on Windows
                            
                                Connecting to PostgreSQL database through SSH tunneling in Python
                            
                                Plot colored polygons with geodataframe in folium
                            
                                Python: sqlite no matching distribution found for sqlite
                            
                                Bad Request (400) using Django, Heroku, and Name.com
                            
                                Python can't open symlinked file
                            
                                How to set the R_HOME environment variable to the R home directory?
                            
                                Concise way to filter data in xarray
                            
                                pandas: get the value of the index for a row?
                            
                                Python hash() function on strings
                            
                                Name of a Python function in a stack trace
                            
                                How to create an async generator in Python?
                            
                                How to apply pos_tag_sents() to pandas dataframe efficiently
                            
                                How to access Slack's Interactive Message request payload parameter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between Linear Regression Coefficients between Python and R

Tags:

python

pandas

r

scikit-learn

regression

Nizag

People also ask

2 Answers

Joris Meys

cast42

Recent Activity

Donate For Us