OneHotEncoder with string categorical values

Tags:

scikit-learn

I have the following numpy matrix:

M = [
    ['a', 5, 0.2, ''],
    ['a', 2, 1.3, 'as'],
    ['b', 1, 2.3, 'as'],
]
M = np.array(M)

I would like to encode categorical values ('a', 'b', '', 'as'). I tried to encode it using OneHotEncoder. The problem is that is does not work with string variables and generates the error.

enc = preprocessing.OneHotEncoder()
enc.fit(M)
enc.transform(M).toarray()

I know that I have to use categorical_features to show which values I am going to encode and I thought that by providing dtype I will be able to handle string values, but I can not. So is there a way to encode categorical values in my matrix?

953

asked Oct 08 '15 06:10

Salvador Dali

1 Answers

You can use DictVectorizer:

from sklearn.feature_extraction import DictVectorizer
import pandas as pd

dv = DictVectorizer(sparse=False) 
df = pd.DataFrame(M).convert_objects(convert_numeric=True)
dv.fit_transform(df.to_dict(orient='records'))

array([[ 5. ,  0.2,  1. ,  0. ,  1. ,  0. ],
       [ 2. ,  1.3,  1. ,  0. ,  0. ,  1. ],
       [ 1. ,  2.3,  0. ,  1. ,  0. ,  1. ]])

dv.feature_names_ holds correspondence to the columns:

[1, 2, '0=a', '0=b', '3=', '3=as']

answered Oct 20 '22 00:10

hellpanderr

Related questions
                            
                                Python Force List Index out of Range Exception
                            
                                How to represent graphs with IPython
                            
                                PyQt Enable/Disable elements in a QComboBox
                            
                                Python plot a graph from values inside dictionary
                            
                                Reading part of a file in S3 using Boto
                            
                                String split with default delimiter vs user defined delimiter
                            
                                What is this syntax "..." (ellipsis)? [duplicate]
                            
                                Pandas returns "Passed header names mismatches usecols" error
                            
                                How can I turn a string model field into a select input in Flask-Admin?
                            
                                Solving system using linalg with constraints
                            
                                Must Python script define a function as main?
                            
                                How do I make a class that is also a list?
                            
                                Django exclude from annotation count
                            
                                In Python, how can I check that a string does not contain any string from a list?
                            
                                Using Seaborn, how do I get all the elements from a pointplot to appear above the elements of a violoinplot?
                            
                                Arbitrary host name resolution in Ansible
                            
                                how to color data points based on some rules in matplotlib
                            
                                Python3 Tkinter fonts not working
                            
                                boto3 and SWF example needed
                            
                                Get python modules into Visual studio 2015 Community edition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With