Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using categorical data as features in sklean LogisticRegression

I'm trying to understand how to use categorical data as features in sklearn.linear_model's LogisticRegression.

I understand of course I need to encode it.

  1. What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.

  2. (Less important) Can somebody explain the difference between using preprocessing.LabelEncoder(), DictVectorizer.vocabulary or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.

Especially with the first one!

like image 964
Optimesh Avatar asked Nov 28 '15 21:11

Optimesh


People also ask

Can Sklearn logistic regression handle categorical variables?

In recent sklearn versions you can now use le. fit for categorical variables with more than two classes.

Can logistic regression work with categorical features?

Yes, you can train a logistic regression model on categorical data. Each feature will be basically on/off which actually simplifies the things.

Can we apply logistic regression to categorical variables?

Logistic regression is a pretty flexible method. It can readily use as independent variables categorical variables. Most software that use Logistic regression should let you use categorical variables.


1 Answers

Suppose the type of each categorical variable is "object". Firstly, you can create an panda.index of categorical column names:

import pandas as pd    
catColumns = df.select_dtypes(['object']).columns

Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the LabelEncoder() to convert it to 0 and 1. For categorical variables with more than two categories, use pd.getDummies() to obtain the indicator variables and then drop one category (to avoid multicollinearity issue).

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for col in catColumns:
    n = len(df[col].unique())
    if (n > 2):
       X = pd.get_dummies(df[col])
       X = X.drop(X.columns[0], axis=1)
       df[X.columns] = X
       df.drop(col, axis=1, inplace=True)  # drop the original categorical variable (optional)
    else:
       le.fit(df[col])
       df[col] = le.transform(df[col])
like image 168
Yongkai Avatar answered Nov 08 '22 08:11

Yongkai