Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do linear regression using Python and Scikit learn using one hot encoding?

I am trying to use linear regression in combination with python and scikitlearn to answer the question "can user session lengths be predicted given user demographic information?"

I am using linear regression because the user session lengths are in milliseconds, which is continuous. I one hot encoded all of my categorical variables including gender, country, and age range.

I am not sure how to take into account my one hot encoding, or if I even need to.

Input Data:

enter image description here

I tried reading here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

I understand the inputs is my main are whether to calculate a fit intercept, normalize, copy x (all boolean), and then n jobs.

I'm not sure what factors to take into account when deciding on these inputs. I'm also concerned whether my one hot encoding of the variables makes an impact.

like image 352
pr338 Avatar asked Dec 25 '16 23:12

pr338


2 Answers

You can do like:

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

# X is a numpy array with your features
# y is the label array
enc = OneHotEncoder(sparse=False)
X_transform = enc.fit_transform(X)

# apply your linear regression as you want
model = LinearRegression()
model.fit(X_transform, y)

print("Mean squared error: %.2f" % np.mean((model.predict(X_transform) - y) ** 2))

Please note that this example I am training and testing with the same dataset! This may cause an overfit in your model. You should avoid that splitting the data or doing cross-validation.

like image 176
silviomoreto Avatar answered Oct 23 '22 10:10

silviomoreto


I just wanted to fit a linear regression with sklearn which I use as benchmark for other non-linear approaches, such as MLPRegressor, but also variations of linear regression, such as Ridge, Lasso and ElasticNet (see here for an introduction to this group: http://scikit-learn.org/stable/modules/linear_model.html).

Doing it the same ways as described by @silviomoreto (which worked for all other models) actually for me resulted in an errogenous model (very high errors). This is most likely due to the so called dummy variable trap, which occurs due to multicollinearity in the variables when you include one dummy variable per category for categoric variables -- which is exactly what OneHotEncoder does! See also the following discussion on statsexchange: https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn.

To avoid this, I wrote a simple wrapper that excludes one variable, which then acts as the default.

class DummyEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, n_values='auto'):
        self.n_values = n_values

    def transform(self, X):
        ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
        return ohe.fit_transform(X)[:,:-1]

    def fit(self, X, y=None, **fit_params):
        return self

So building on the code of @silviomoreto, you would change line 6:

enc = DummyEncoder()

This solved the problem for me. Note that OneHotEncoder worked fine (and better) for all other models, such as Ridge, Lasso and ANN.

I chose this way, because I wanted to include it in my feature pipeline. But you seem to have the data already encoded. Here, you would have to drop one column per category (e.g. for male/female only include one). So if you for example used pandas.get_dummies(...), this can be done with the parameter drop_first=True.

Last but not least, if you really need to go deeper into linear regression in Python, and not use it just as a benchmark, I would recommend statsmodels over scikit-learn (https://pypi.python.org/pypi/statsmodels), as it provides better model statistics, e.g. p-values per variable, etc.

like image 28
Marcus V. Avatar answered Oct 23 '22 11:10

Marcus V.