Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issue with OneHotEncoder for categorical features

I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following:

from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values) 

However, I couldn't proceed as I am getting this error:

    array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: PG 

I am surprised why it is complaining about the string as it is supposed to convert it!! Am I missing something here?

like image 254
Medo Avatar asked Apr 24 '17 12:04

Medo


People also ask

What is categorical features in OneHotEncoder?

OneHotEncoder. Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.

What are the possible challenges when performing one-hot encoding on a categorical variable?

Challenges of One-Hot Encoding: Dummy Variable Trap Dummy Variable Trap is a scenario in which variables are highly correlated to each other. The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features.

What is the drawback of using one-hot encoding?

Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy. In addition, you may wish to transform the values back to categorical form so that they may be displayed in your application.

What is the limitation of the label encoding method?

Limitation of label Encoding Label encoding converts the data in machine-readable form, but it assigns a unique number(starting from 0) to each class of data. This may lead to the generation of priority issues in the training of data sets.


2 Answers

If you read the docs for OneHotEncoder you'll see the input for fit is "Input array of type int". So you need to do two steps for your one hot encoded data

from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.LabelEncoder() enc.fit(cat_features) new_cat_features = enc.transform(cat_features) print new_cat_features # [1 2 0] new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read print ohe.fit_transform(new_cat_features) 

Output:

[[ 0.  1.  0.]  [ 0.  0.  1.]  [ 1.  0.  0.]] 

EDIT

As of 0.20 this became a bit easier, not only because OneHotEncoder now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer, see below for an example

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import LabelEncoder, OneHotEncoder import numpy as np  X = np.array([['apple', 'red', 1, 'round', 0],               ['orange', 'orange', 2, 'round', 0.1],               ['bannana', 'yellow', 2, 'long', 0],               ['apple', 'green', 1, 'round', 0.2]]) ct = ColumnTransformer(     [('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),],  # the column numbers I want to apply this to     remainder='passthrough'  # This leaves the rest of my columns in place ) print(ct2.fit_transform(X)) # Notice the output is a string 

Output:

[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']  ['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']  ['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']  ['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']] 
like image 140
piman314 Avatar answered Sep 19 '22 17:09

piman314


You can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:

cat_features = ['color', 'director_name', 'actor_2_name'] encoder = LabelBinarizer() new_cat_features = encoder.fit_transform(cat_features) new_cat_features 

Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.

Source Hands-On Machine Learning with Scikit-Learn and TensorFlow

like image 34
Fallou Tall Avatar answered Sep 18 '22 17:09

Fallou Tall