I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing
from sklearn.preprocessing to do so as the following:
from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values)
However, I couldn't proceed as I am getting this error:
array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: PG
I am surprised why it is complaining about the string as it is supposed to convert it!! Am I missing something here?
OneHotEncoder. Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.
Challenges of One-Hot Encoding: Dummy Variable Trap Dummy Variable Trap is a scenario in which variables are highly correlated to each other. The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features.
Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy. In addition, you may wish to transform the values back to categorical form so that they may be displayed in your application.
Limitation of label Encoding Label encoding converts the data in machine-readable form, but it assigns a unique number(starting from 0) to each class of data. This may lead to the generation of priority issues in the training of data sets.
If you read the docs for OneHotEncoder
you'll see the input for fit
is "Input array of type int". So you need to do two steps for your one hot encoded data
from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.LabelEncoder() enc.fit(cat_features) new_cat_features = enc.transform(cat_features) print new_cat_features # [1 2 0] new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read print ohe.fit_transform(new_cat_features)
Output:
[[ 0. 1. 0.] [ 0. 0. 1.] [ 1. 0. 0.]]
EDIT
As of 0.20
this became a bit easier, not only because OneHotEncoder
now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer
, see below for an example
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import LabelEncoder, OneHotEncoder import numpy as np X = np.array([['apple', 'red', 1, 'round', 0], ['orange', 'orange', 2, 'round', 0.1], ['bannana', 'yellow', 2, 'long', 0], ['apple', 'green', 1, 'round', 0.2]]) ct = ColumnTransformer( [('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),], # the column numbers I want to apply this to remainder='passthrough' # This leaves the rest of my columns in place ) print(ct2.fit_transform(X)) # Notice the output is a string
Output:
[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0'] ['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1'] ['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0'] ['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]
You can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:
cat_features = ['color', 'director_name', 'actor_2_name'] encoder = LabelBinarizer() new_cat_features = encoder.fit_transform(cat_features) new_cat_features
Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.
Source Hands-On Machine Learning with Scikit-Learn and TensorFlow
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With