Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OneHotEncoder with string categorical values

I have the following numpy matrix:

M = [
    ['a', 5, 0.2, ''],
    ['a', 2, 1.3, 'as'],
    ['b', 1, 2.3, 'as'],
]
M = np.array(M)

I would like to encode categorical values ('a', 'b', '', 'as'). I tried to encode it using OneHotEncoder. The problem is that is does not work with string variables and generates the error.

enc = preprocessing.OneHotEncoder()
enc.fit(M)
enc.transform(M).toarray()

I know that I have to use categorical_features to show which values I am going to encode and I thought that by providing dtype I will be able to handle string values, but I can not. So is there a way to encode categorical values in my matrix?

like image 953
Salvador Dali Avatar asked Oct 08 '15 06:10

Salvador Dali


People also ask

What is categorical features in OneHotEncoder?

OneHotEncoder. Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.

Do we encode categorical variables for decision tree?

This is needed because not all the machine learning algorithms can deal with categorical data. Many of them cannot operate on label data directly. They require all input variables and output variables to be numeric. That's why We need to encode them.

What challenges one may face by applying one-hot encoding on a categorical variable of train dataset?

What challenges you may face if you have applied OHE on a categorical variable of train dataset? A) All categories of categorical variable are not present in the test dataset. B) Frequency distribution of categories is different in train as compared to the test dataset.


1 Answers

You can use DictVectorizer:

from sklearn.feature_extraction import DictVectorizer
import pandas as pd

dv = DictVectorizer(sparse=False) 
df = pd.DataFrame(M).convert_objects(convert_numeric=True)
dv.fit_transform(df.to_dict(orient='records'))

array([[ 5. ,  0.2,  1. ,  0. ,  1. ,  0. ],
       [ 2. ,  1.3,  1. ,  0. ,  0. ,  1. ],
       [ 1. ,  2.3,  0. ,  1. ,  0. ,  1. ]])

dv.feature_names_ holds correspondence to the columns:

[1, 2, '0=a', '0=b', '3=', '3=as']

like image 72
hellpanderr Avatar answered Oct 20 '22 00:10

hellpanderr