sklearn mask for onehotencoder does not work

Tags:

Considering data like:

from sklearn.preprocessing import OneHotEncoder
import numpy as np
dt = 'object, i4, i4'
d = np.array([('aaa', 1, 1), ('bbb', 2, 2)], dtype=dt)

I want to exclude the text column using the OHE functionality.

Why does the following not work?

Click to copy

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool))       
ohe.fit(d)
ValueError: could not convert string to float: 'bbb'

It says in the documentation:

Click to copy

categorical_features: “all” or array of indices or mask :
  Specify what features are treated as categorical.
   ‘all’ (default): All features are treated as categorical.
   array of indices: Array of categorical feature indices.
   mask: Array of length n_features and with dtype=bool.

I'm using a mask, yet it still tries to convert to float.

Even using

Click to copy

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool), 
                    dtype=dt)        
ohe.fit(d)

Same error.

And also in the case of "array of indices":

Click to copy

ohe = OneHotEncoder(categorical_features=np.array([1, 2]), dtype=dt)        
ohe.fit(d)

831

asked Dec 04 '15 13:12

PascalVKooten

2 Answers

You should understand that all estimators in Scikit-Learn were designed only for numerical inputs. Thus from this point of view there is no sense to leave text column in this form. You have to transform that text column in something numerical, or get rid of it.

If you obtained your dataset from Pandas DataFrame - you can take a look at this small wrapper: https://github.com/paulgb/sklearn-pandas. It will help you to transform all needed columns simultaneously (or leave some of rows in numerical form)

Click to copy

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})

#    number_1  number_2 text
# 0         1         2  aaa
# 1         1         2  bbb

# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
    ('text', SomeEncoder),
    (['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)

answered Oct 20 '22 17:10

Ibraim Ganiev

I think there's some confusion here. You still need to enter the numerical values, but within the encoder you can specify which values are categorical which are not.

The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.

So in the example below I change aaa to 5 and bbb to 6. This way it will distinguish from the 1 and 2 numerical values:

Click to copy

d = np.array([[5, 1, 1], [6, 2, 2]])
ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))
ohe.fit(d)

Now you can check your feature categories:

Click to copy

ohe.active_features_
Out[22]: array([5, 6], dtype=int64)

answered Oct 20 '22 19:10

Leb

Related questions
                            
                                Embedding reStructuredText in Python docstrings
                            
                                Detecting NSFW submissions with praw
                            
                                Python: print base class variables
                            
                                Data munging in pandas
                            
                                This is forbidden when an 'atomic' block is active. Django 1.8
                            
                                Getting ImportError when running nosetests
                            
                                Find "date" in generic webpage using Python
                            
                                .head() and .tail() with negative indexes on pandas GroupBy object
                            
                                Python Progress Bar ValueError: Value out of range
                            
                                Django Related only Field list filter- Not RelationField
                            
                                How to read serialized data by python2 cPikle with python3 pickle?
                            
                                To merge two dictionaries of list in Python
                            
                                Set up logging early: Catch warnings emmited during importing
                            
                                How to use time.sleep in pygame?
                            
                                Overwriting row in same csv file using dictwriter
                            
                                Pandas rolling_max with variable window size specified in a df column
                            
                                Neural Networks: Understanding theano Library
                            
                                lxml can not parse xml (wether encoding is utf-8 or not) [python]
                            
                                bokeh, two y axis, disable one axis for zoom/ panning
                            
                                Unit Testing File Modifications

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sklearn mask for onehotencoder does not work

Tags:

python

numpy

one-hot-encoding

scikit-learn

transformation

PascalVKooten

People also ask

2 Answers

Ibraim Ganiev

Leb

Recent Activity

Donate For Us