Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn mask for onehotencoder does not work

Considering data like:

from sklearn.preprocessing import OneHotEncoder
import numpy as np
dt = 'object, i4, i4'
d = np.array([('aaa', 1, 1), ('bbb', 2, 2)], dtype=dt)  

I want to exclude the text column using the OHE functionality.

Why does the following not work?

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool))       
ohe.fit(d)
ValueError: could not convert string to float: 'bbb'

It says in the documentation:

categorical_features: “all” or array of indices or mask :
  Specify what features are treated as categorical.
   ‘all’ (default): All features are treated as categorical.
   array of indices: Array of categorical feature indices.
   mask: Array of length n_features and with dtype=bool.

I'm using a mask, yet it still tries to convert to float.

Even using

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool), 
                    dtype=dt)        
ohe.fit(d)

Same error.

And also in the case of "array of indices":

ohe = OneHotEncoder(categorical_features=np.array([1, 2]), dtype=dt)        
ohe.fit(d)
like image 831
PascalVKooten Avatar asked Dec 04 '15 13:12

PascalVKooten


People also ask

What does Sklearn OneHotEncoder do?

sklearn.preprocessing .OneHotEncoder. Encode categorical features as a one-hot numeric array. By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

How do I use OneHotEncoder in Python?

One-Hot Encoding in Python OneHotEncoder from SciKit library only takes numerical categorical values, hence any value of string type should be label encoded before one hot encoded. So taking the dataframe from the previous example, we will apply OneHotEncoder on column Bridge_Types_Cat.

What is the difference between OneHotEncoder and LabelEncoder?

As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent. So, that's the difference between Label Encoding and One Hot Encoding. Follow me on Twitter for more Data Science, Machine Learning, and general tech updates.

What does OneHotEncoder return?

one hot encoder would return a 2d array of size data_length x num_categories .


2 Answers

You should understand that all estimators in Scikit-Learn were designed only for numerical inputs. Thus from this point of view there is no sense to leave text column in this form. You have to transform that text column in something numerical, or get rid of it.

If you obtained your dataset from Pandas DataFrame - you can take a look at this small wrapper: https://github.com/paulgb/sklearn-pandas. It will help you to transform all needed columns simultaneously (or leave some of rows in numerical form)

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})

#    number_1  number_2 text
# 0         1         2  aaa
# 1         1         2  bbb

# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
    ('text', SomeEncoder),
    (['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)
like image 86
Ibraim Ganiev Avatar answered Oct 20 '22 17:10

Ibraim Ganiev


I think there's some confusion here. You still need to enter the numerical values, but within the encoder you can specify which values are categorical which are not.

The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.

So in the example below I change aaa to 5 and bbb to 6. This way it will distinguish from the 1 and 2 numerical values:

d = np.array([[5, 1, 1], [6, 2, 2]])
ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))
ohe.fit(d)

Now you can check your feature categories:

ohe.active_features_
Out[22]: array([5, 6], dtype=int64)
like image 39
Leb Avatar answered Oct 20 '22 19:10

Leb