Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dummify categorical variables for logistic regression with pandas and scikit (OneHotEncoder)

I read this blog about new things in scikit. The OneHotEncoder taking strings seems like a useful feature. Below my attempt to use this

import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

train_df = pd.read_csv('../../data/train.csv', usecols=cols)
test_df = pd.read_csv('../../data/test.csv', usecols=[e for e in cols if e != 'Survived'])

train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()

ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False), ['Sex', 'Embarked'])], remainder='passthrough')

X_train_t = ct.fit_transform(train_df)
X_test_t  = ct.fit_transform(test_df)

print(X_train_t[0])
print(X_test_t[0])

# [ 0.    1.    0.    0.    1.    0.    3.   22.    1.    0.    7.25]
# [ 0.    1.    0.    1.    0.          3. 34.5     0.    0.  7.8292]

logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t) # ValueError: X has 10 features per sample; expecting 11
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

print(acc_log)

I encounter the below python error with this code and also I have some additional concerns.

ValueError: X has 10 features per sample; expecting 11

To start from the beginning .. this script is written for the "titanic" dataset from kaggle. We have five numerical columns Pclass, Age, SibSp, Parch and Fare. The columns Sex and Embarked are categories male/female and Q/S/C (which is an abbreviation for a city name).

What I understood from the OneHotEncoder is that it creates dummy variables by placing additional columns. Well actually the output of ct.fit_transform() is no longer a pandas dataframe but a numpy array now. But as seen in the print debug statement there are more than the original 7 columns now.

There are three problems I encounter:

  1. For some reason the test.csv has one less column. That would indicate to me that there is on less option in one of the categories. To fix that i would have to find all the available options in the categories over both train + test data. And then use these options (such as male/female) to transform the train and the test data separately. I have no idea how to do this with the tools i'm working with (pandas, scikit, etc). On second thought .. after inspecting the data i can not find the missing option in the test.csv ..

  2. I want to avoid the "dummy variable trap". Right now it seems that there are too many columns created. I was expecting 1 column for Sex (total options 2 - 1 to avoid trap) and 2 for embarked. With the additional 5 numerical columns that would come to 8 total.

  3. I don't recognize the output of the transform anymore. I would rather prefer a new dataframe where the new dummy columns have given their own name, such as Sex_male (1/0) Embarked_Q (1/0) and Embarked_S(1/0)

I'm only used to using gretl, there dummifying a variable and leaving out one option is very natural. I don't know in python if i'm doing it wrong or if this scenario is not part of the standard scikit toolkit. Any advice? Maybe I should write a custom encoder for this?

like image 475
Flip Avatar asked Dec 25 '19 19:12

Flip


People also ask

What is the difference between OneHotEncoder and Get_dummies?

(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.

How do you handle a categorical variable with many levels?

To deal with categorical variables that have more than two levels, the solution is one-hot encoding. This takes every level of the category (e.g., Dutch, German, Belgian, and other), and turns it into a variable with two levels (yes/no).

How do you convert categorical variables to dummy variables in Python?

To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) .

Do you need one-hot encoding for binary variables?

One hot encoding into k-1 variables: One hot encoding with k-1 binary variables should be used in linear regression, to keep the correct number of degrees of freedom (k-1). The linear regression has access to all of the features as it is being trained, and therefore examines altogether the whole set of dummy variables.

Can you do logistic regression with multiple variables in Python?

Code Candy Logistic Regression : Encoding Categorical Variables in Python. Writing code for data mining with scikit-learn in python, will inevitably lead you to solve a logistic regression problem with multiple categorical variables in the data.

How to implement one-hot and dummy encoding in pandas?

Both one-hot and dummy encoding can be implemented in Pandas by using its get_dummies function. data — Here we specify the data that we need to encode. It can be a NumPy array, Pandas Series or a DataFrame. prefix — If we specify a prefix, it will add to the column names so that we can easily identify the columns.

Can I Pass categorical data to scikit-learn’s onehotencoder() function?

Yes, this is because Scikit-learn’s OneHotEncoder () function has created dummy variables for numerical variables also. To avoid this, we should only pass categorical data when calling its fit_transform method.

How to differentiate categorical data?

The process of differentiating categorical data using predictive techniques is called classification. One of the most widely used classification techniques is the logistic regression. For the theoretical foundation of the logistic regression, please see my previous article.


1 Answers

I will try and answer all your questions individually.

Answer for Question 1

In your code you have used fit_transform method both on your train and test data which is not the correct way of doing it. Generally, fit_transform is applied only on your train data set, and it returns a transformer which is then just used to transform your test data set. When you apply fit_transform on your test data, you just transform your test data with just the options/levels of the categorical variables available only in your test data set and it is very much possible that your test data may not contain all options/levels of all categorical variables, due to which the dimension of your train and test data set will differ resulting in the error which you have got.

So the correct way of doing it would be:

X_train_t = ct.fit_transform(X_train)
X_test_t  = ct.transform(X_test)

Answer for Question 2

If you want to avoid the "dummy variable trap" you can make use of the parameter drop (by setting it to first) while creating the OneHotEncoder object in the ColumnTransformer, this will result in creating just one column for sex and two columns for Embarked since they have two and three options/levels respectively.

So the correct way of doing it would be:

ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex','Embarked'])], remainder='passthrough')

Answer for Question 3

As of now the get_feature_names method which can be reconstruct your data frame with new dummy columns is not implemented insklearn yet. One work around for this would be to change the reminder to drop in the ColumnTransformer construction and construct your data frame separately as shown below:

ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex', 'Embarked'])], remainder='drop')
A = pd.concat([X_train.drop(["Sex", "Embarked"], axis=1), pd.DataFrame(X_train_t, columns=ct.get_feature_names())], axis=1) 
A.head()

which will result in something like this:

enter image description here

Your final code will look like this:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

train_df = pd.read_csv('train.csv', usecols=cols)
test_df = pd.read_csv('test.csv', usecols=[e for e in cols if e != 'Survived'])

cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

train_df = train_df.dropna()
test_df = test_df.dropna()

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()

categorical_values = ['Sex', 'Embarked']
X_train_cont = X_train.drop(categorical_values, axis=1)
X_test_cont = X_test.drop(categorical_values, axis=1)

ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), categorical_values)], remainder='drop')

X_train_categorical = ct.fit_transform(X_train)
X_test_categorical  = ct.transform(X_test)

X_train_t = pd.concat([X_train_cont, pd.DataFrame(X_train_categorical, columns=ct.get_feature_names())], axis=1)
X_test_t = pd.concat([X_test_cont, pd.DataFrame(X_test_categorical, columns=ct.get_feature_names())], axis=1)

logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t)

acc_log = round(logreg.score(X_train_t, Y_train) * 100, 2)

print(acc_log)

80.34

And when you do X_train_t.head() you get

enter image description here

Hope this helps!

like image 143
Parthasarathy Subburaj Avatar answered Sep 28 '22 21:09

Parthasarathy Subburaj