Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add interaction term in Python sklearn

Tags:

If I have independent variables [x1, x2, x3] If I fit linear regression in sklearn it will give me something like this:

y = a*x1 + b*x2 + c*x3 + intercept

Polynomial regression with poly =2 will give me something like

y = a*x1^2 + b*x1*x2 ......

I don't want to have terms with second degree like x1^2.

how can I get

y = a*x1 + b*x2 + c*x3 + d*x1*x2

if x1 and x2 have high correlation larger than some threshold value j .

like image 221
Dylan Avatar asked Aug 23 '17 00:08

Dylan


People also ask

What does adding an interaction term do?

Adding an interaction term to a model drastically changes the interpretation of all the coefficients. Without an interaction term, we interpret B1 as the unique effect of Bacteria on Height. But the interaction means that the effect of Bacteria on Height is different for different values of Sun.

Should I include an interaction term?

When the effect of one independent variable depends on the level of another independent variable, we have an interaction; and an interaction term should be included in the regression equation.


2 Answers

For generating polynomial features, I assume you are using sklearn.preprocessing.PolynomialFeatures

There's an argument in the method for considering only the interactions. So, you can write something like:

poly = PolynomialFeatures(interaction_only=True,include_bias = False) poly.fit_transform(X) 

Now only your interaction terms are considered and higher degrees are omitted. Your new feature space becomes [x1,x2,x3,x1*x2,x1*x3,x2*x3]

You can fit your regression model on top of that

clf = linear_model.LinearRegression() clf.fit(X, y) 

Making your resultant equation y = a*x1 + b*x2 + c*x3 + d*x1*x + e*x2*x3 + f*x3*x1

Note: If you have high dimensional feature space, then this would lead to curse of dimensionality which might cause problems like overfitting/high variance

like image 116
harsha Avatar answered Sep 28 '22 16:09

harsha


Use patsy to construct a design matrix as follows:

y, X = dmatrices('y ~ x1 + x2 + x3 + x1:x2', your_data)

Where your_data is e.g. a DataFrame with response column y and input columns x1, x2 and x3.

Then just call the fit method of your estimator, e.g. LinearRegression().fit(X,y).

like image 23
DontDivideByZero Avatar answered Sep 28 '22 16:09

DontDivideByZero