Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One hot encoding of string categorical features

I'm trying to perform a one hot encoding of a trivial dataset.

data = [['a', 'dog', 'red']         ['b', 'cat', 'green']] 

What's the best way to preprocess this data using Scikit-Learn?

On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.

So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.

So, what's the best way to do it in Scikit-Learn?

Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.

like image 469
hlin117 Avatar asked Jan 30 '16 21:01

hlin117


People also ask

How does one-hot encode categorical features?

One-Hot Encoding is the process of creating dummy variables. This technique is used for categorical variables where order does not matter. One-Hot encoding technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created.

How do you encode categorical features?

In this encoding scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then the numbers are transformed in the binary number. After that binary value is split into different columns. Binary encoding works really well when there are a high number of categories.

What challenges one may face by applying one-hot encoding on a categorical variable of train dataset?

What challenges you may face if you have applied OHE on a categorical variable of train dataset? A) All categories of categorical variable are not present in the test dataset. B) Frequency distribution of categories is different in train as compared to the test dataset.


1 Answers

If you are on sklearn>0.20.dev0

In [11]: from sklearn.preprocessing import OneHotEncoder     ...: cat = OneHotEncoder()     ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T     ...: cat.fit_transform(X).toarray()     ...:  Out[11]: array([[1., 0., 0., 1., 0.],            [0., 1., 0., 0., 1.],            [1., 0., 0., 1., 0.],            [0., 0., 1., 0., 1.]]) 

If you are on sklearn==0.20.dev0

In [30]: cat = CategoricalEncoder()  In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T  In [32]: cat.fit_transform(X).toarray() Out[32]: array([[ 1.,  0., 0.,  1.,  0.],        [ 0.,  1.,  0.,  0.,  1.],        [ 1.,  0.,  0.,  1.,  0.],        [ 0.,  0.,  1.,  0.,  1.]]) 

Another way to do it is to use category_encoders.

Here is an example:

% pip install category_encoders import category_encoders as ce le =  ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore") X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']]) le.fit_transform(X) array([[1, 0, 1, 0, 1, 0],        [0, 1, 0, 1, 0, 1]]) 
like image 69
zipp Avatar answered Sep 18 '22 12:09

zipp