Possible ways to do one hot encoding in scikit-learn?

Question

I have a pandas data frame with some categorical columns. Some of these contains non-integer values.

I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummies for that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.

Therefore, I am looking for other methods to do one-hot coding.

What are possible ways to do one hot encoding in python (pandas/sklearn)?

David Maust · Accepted Answer

Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer.

For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.

label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)

For the test data you can use the same set of categories using transform.

test_mat = label_binarizer.transform(test_df.Label)

Possible ways to do one hot encoding in scikit-learn?

Tags:

python

pandas

scikit-learn

Nguyen Ngoc Tuan

1 Answers

David Maust

Recent Activity

Donate For Us

Possible ways to do one hot encoding in scikit-learn?

Tags:

python

pandas

scikit-learn

Nguyen Ngoc Tuan

1 Answers

David Maust

Related questions

Recent Activity

Donate For Us