Dummify categorical variables for logistic regression with pandas and scikit (OneHotEncoder)

Tags:

I read this blog about new things in scikit. The OneHotEncoder taking strings seems like a useful feature. Below my attempt to use this

import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

train_df = pd.read_csv('../../data/train.csv', usecols=cols)
test_df = pd.read_csv('../../data/test.csv', usecols=[e for e in cols if e != 'Survived'])

train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()

ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False), ['Sex', 'Embarked'])], remainder='passthrough')

X_train_t = ct.fit_transform(train_df)
X_test_t  = ct.fit_transform(test_df)

print(X_train_t[0])
print(X_test_t[0])

# [ 0.    1.    0.    0.    1.    0.    3.   22.    1.    0.    7.25]
# [ 0.    1.    0.    1.    0.          3. 34.5     0.    0.  7.8292]

logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t) # ValueError: X has 10 features per sample; expecting 11
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

print(acc_log)

I encounter the below python error with this code and also I have some additional concerns.

ValueError: X has 10 features per sample; expecting 11

To start from the beginning .. this script is written for the "titanic" dataset from kaggle. We have five numerical columns Pclass, Age, SibSp, Parch and Fare. The columns Sex and Embarked are categories male/female and Q/S/C (which is an abbreviation for a city name).

What I understood from the OneHotEncoder is that it creates dummy variables by placing additional columns. Well actually the output of ct.fit_transform() is no longer a pandas dataframe but a numpy array now. But as seen in the print debug statement there are more than the original 7 columns now.

There are three problems I encounter:

For some reason the test.csv has one less column. That would indicate to me that there is on less option in one of the categories. To fix that i would have to find all the available options in the categories over both train + test data. And then use these options (such as male/female) to transform the train and the test data separately. I have no idea how to do this with the tools i'm working with (pandas, scikit, etc). On second thought .. after inspecting the data i can not find the missing option in the test.csv ..
I want to avoid the "dummy variable trap". Right now it seems that there are too many columns created. I was expecting 1 column for Sex (total options 2 - 1 to avoid trap) and 2 for embarked. With the additional 5 numerical columns that would come to 8 total.
I don't recognize the output of the transform anymore. I would rather prefer a new dataframe where the new dummy columns have given their own name, such as Sex_male (1/0) Embarked_Q (1/0) and Embarked_S(1/0)

I'm only used to using gretl, there dummifying a variable and leaving out one option is very natural. I don't know in python if i'm doing it wrong or if this scenario is not part of the standard scikit toolkit. Any advice? Maybe I should write a custom encoder for this?

475

asked Dec 25 '19 19:12

Flip

1 Answers

I will try and answer all your questions individually.

Answer for Question 1

In your code you have used fit_transform method both on your train and test data which is not the correct way of doing it. Generally, fit_transform is applied only on your train data set, and it returns a transformer which is then just used to transform your test data set. When you apply fit_transform on your test data, you just transform your test data with just the options/levels of the categorical variables available only in your test data set and it is very much possible that your test data may not contain all options/levels of all categorical variables, due to which the dimension of your train and test data set will differ resulting in the error which you have got.

So the correct way of doing it would be:

X_train_t = ct.fit_transform(X_train)
X_test_t  = ct.transform(X_test)

Answer for Question 2

If you want to avoid the "dummy variable trap" you can make use of the parameter drop (by setting it to first) while creating the OneHotEncoder object in the ColumnTransformer, this will result in creating just one column for sex and two columns for Embarked since they have two and three options/levels respectively.

So the correct way of doing it would be:

ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex','Embarked'])], remainder='passthrough')

Answer for Question 3

As of now the get_feature_names method which can be reconstruct your data frame with new dummy columns is not implemented insklearn yet. One work around for this would be to change the reminder to drop in the ColumnTransformer construction and construct your data frame separately as shown below:

ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex', 'Embarked'])], remainder='drop')
A = pd.concat([X_train.drop(["Sex", "Embarked"], axis=1), pd.DataFrame(X_train_t, columns=ct.get_feature_names())], axis=1) 
A.head()

which will result in something like this:

enter image description here

Your final code will look like this:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

train_df = pd.read_csv('train.csv', usecols=cols)
test_df = pd.read_csv('test.csv', usecols=[e for e in cols if e != 'Survived'])

cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

train_df = train_df.dropna()
test_df = test_df.dropna()

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()

categorical_values = ['Sex', 'Embarked']
X_train_cont = X_train.drop(categorical_values, axis=1)
X_test_cont = X_test.drop(categorical_values, axis=1)

ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), categorical_values)], remainder='drop')

X_train_categorical = ct.fit_transform(X_train)
X_test_categorical  = ct.transform(X_test)

X_train_t = pd.concat([X_train_cont, pd.DataFrame(X_train_categorical, columns=ct.get_feature_names())], axis=1)
X_test_t = pd.concat([X_test_cont, pd.DataFrame(X_test_categorical, columns=ct.get_feature_names())], axis=1)

logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t)

acc_log = round(logreg.score(X_train_t, Y_train) * 100, 2)

print(acc_log)

80.34

And when you do X_train_t.head() you get

enter image description here

Hope this helps!

143

answered Sep 28 '22 21:09

Parthasarathy Subburaj

Related questions
                            
                                Proper autogenerate of __str__() implementation also for sqlalchemy classes?
                            
                                Calculate mean in pandas with even and odd columns
                            
                                Use python Requests to download an compressed tar.gzfile and unzip it using tar
                            
                                Annotated heatmap with multiple color schemes
                            
                                Jupyter: How can I interactively select series to plot using widgets.SelectMultiple()?
                            
                                fit "break IF condition" statement into one line
                            
                                How to use multiprocessing with requests module?
                            
                                How can I prevent context switching when calling an async function?
                            
                                Add timestamp and username to log
                            
                                Why does a PySpark UDF that operates on a column generated by rand() fail?
                            
                                OpenCV Rect conventions – What is x, y, width, height?
                            
                                Random list only with -1 and 1
                            
                                error: OpenCV(4.1.0) error: (-215:Assertion failed) !ssize.empty() in function 'cv::resize'
                            
                                How do I resolve "Use scipy.optimize.linear_sum_assignment instead"
                            
                                WRITE only first N rows from pandas df to csv
                            
                                UnboundLocalError: local variable 'batch_index' referenced before assignment
                            
                                No module named 'pmdarima'
                            
                                Kernel error (Errno 13 Permission denied) in Jupyter Notebook, Windows 10
                            
                                Fully convert a black and white image to a set of lines (aka vectorize using only lines)
                            
                                Matplotlib savefig background always transparent

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dummify categorical variables for logistic regression with pandas and scikit (OneHotEncoder)

Tags:

python

pandas

machine-learning

numpy

scikit-learn

Flip

People also ask

1 Answers

Parthasarathy Subburaj

Recent Activity

Donate For Us