Issue with OneHotEncoder for categorical features

Tags:

I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following:

from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values)

However, I couldn't proceed as I am getting this error:

    array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: PG

I am surprised why it is complaining about the string as it is supposed to convert it!! Am I missing something here?

254

asked Apr 24 '17 12:04

Medo

2 Answers

If you read the docs for OneHotEncoder you'll see the input for fit is "Input array of type int". So you need to do two steps for your one hot encoded data

from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.LabelEncoder() enc.fit(cat_features) new_cat_features = enc.transform(cat_features) print new_cat_features # [1 2 0] new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read print ohe.fit_transform(new_cat_features)

Output:

[[ 0.  1.  0.]  [ 0.  0.  1.]  [ 1.  0.  0.]]

EDIT

As of 0.20 this became a bit easier, not only because OneHotEncoder now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer, see below for an example

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import LabelEncoder, OneHotEncoder import numpy as np  X = np.array([['apple', 'red', 1, 'round', 0],               ['orange', 'orange', 2, 'round', 0.1],               ['bannana', 'yellow', 2, 'long', 0],               ['apple', 'green', 1, 'round', 0.2]]) ct = ColumnTransformer(     [('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),],  # the column numbers I want to apply this to     remainder='passthrough'  # This leaves the rest of my columns in place ) print(ct2.fit_transform(X)) # Notice the output is a string

Output:

[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']  ['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']  ['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']  ['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]

140

answered Sep 19 '22 17:09

piman314

You can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:

cat_features = ['color', 'director_name', 'actor_2_name'] encoder = LabelBinarizer() new_cat_features = encoder.fit_transform(cat_features) new_cat_features

Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.

Source Hands-On Machine Learning with Scikit-Learn and TensorFlow

answered Sep 18 '22 17:09

Fallou Tall

Related questions
                            
                                Does TensorFlow have cross validation implemented for its users?
                            
                                How is the TFIDFVectorizer in scikit-learn supposed to work?
                            
                                How to write a custom estimator in sklearn and use cross-validation on it?
                            
                                Using GridSearchCV with AdaBoost and DecisionTreeClassifier
                            
                                TypeError: only integer arrays with one element can be converted to an index
                            
                                label-encoder encoding missing values
                            
                                Insert or delete a step in scikit-learn Pipeline
                            
                                scikit-learn - ROC curve with confidence intervals
                            
                                tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer
                            
                                how to implement walk forward testing in sklearn?
                            
                                sklearn - Cross validation with multiple scores
                            
                                confused about random_state in decision tree of scikit learn
                            
                                GridSearchCV - XGBoost - Early Stopping
                            
                                Early stopping with Keras and sklearn GridSearchCV cross-validation
                            
                                SKlearn import MLPClassifier fails
                            
                                How do you access tree depth in Python's scikit-learn?
                            
                                Will pandas dataframe object work with sklearn kmeans clustering?
                            
                                Adding words to scikit-learn's CountVectorizer's stop list
                            
                                ImportError: cannot import name 'cross_validation' from 'sklearn' [duplicate]
                            
                                Got continuous is not supported error in RandomForestRegressor

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Issue with OneHotEncoder for categorical features

Tags:

scikit-learn

categorical-data

feature-extraction

Medo

People also ask

2 Answers

piman314

Fallou Tall

Recent Activity

Donate For Us