Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit Learn PolynomialFeatures - what is the use of the include_bias option?

In scikit-learn's PolynomialFeatures preprocessor, there is an option to include_bias. This essentially just adds a column of ones to the dataframe. I was wondering what the point of having this was. Of course, you can set it to False. But theoretically how does having or not having a column of ones along with the Polynomial Features generated affect Regression.

This is the explanation in the documentation, but I can't seem to get anything useful out of it relation to why it should be used or not.

include_bias : boolean

If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).

like image 538
Anup Sebastian Avatar asked Dec 10 '22 01:12

Anup Sebastian


1 Answers

Suppose you want to perform the following regression:

y ~ a + b x + c x^2

where x is a generic sample. The best coefficients a,b,c are computed via simple matricial calculus. First, let us denote with X = [1 | X | X^2] a matrix with N rows, where N is the number of samples. The first column is a column of 1s, the second column is a column of values x_i, for all the samples i, the third column is a column of values x_i^2, for all samples i. Let us denote with B the following column vector B=[a b c]^T If Y is a column vector of the N target values for all samples i, we can write the regression as

y ~ X B

The i-th row of this equation is y_i ~ [1 x_i x^2] [a b c]^t = a + b x_i + c x_i^2.

The goal of training a regression is to find B=[a b c] such that X B be as close as possible to y.

If you don't add a column of 1, you are assuming a-priori that a=0, which might not be correct.

In practice, when you write Python code, and you use PolynomialFeatures together with sklearn.linear_model.LinearRegression, the latter takes care by default of adding a column of 1s (since in LinearRegression the fit_intercept parameter is True by default), so you don't need to add it as well in PolynomialFeatures. Therefore, in PolynomialFeatures one usually keeps include_bias=False.

The situation is different if you use statsmodels.OLS instead of LinearRegression

like image 182
Andrea Araldo Avatar answered Dec 28 '22 10:12

Andrea Araldo