I have the following numpy matrix:
M = [
['a', 5, 0.2, ''],
['a', 2, 1.3, 'as'],
['b', 1, 2.3, 'as'],
]
M = np.array(M)
I would like to encode categorical values ('a', 'b', '', 'as'
). I tried to encode it using OneHotEncoder. The problem is that is does not work with string variables and generates the error.
enc = preprocessing.OneHotEncoder()
enc.fit(M)
enc.transform(M).toarray()
I know that I have to use categorical_features
to show which values I am going to encode and I thought that by providing dtype
I will be able to handle string values, but I can not. So is there a way to encode categorical values in my matrix?
OneHotEncoder. Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.
This is needed because not all the machine learning algorithms can deal with categorical data. Many of them cannot operate on label data directly. They require all input variables and output variables to be numeric. That's why We need to encode them.
What challenges you may face if you have applied OHE on a categorical variable of train dataset? A) All categories of categorical variable are not present in the test dataset. B) Frequency distribution of categories is different in train as compared to the test dataset.
You can use DictVectorizer
:
from sklearn.feature_extraction import DictVectorizer
import pandas as pd
dv = DictVectorizer(sparse=False)
df = pd.DataFrame(M).convert_objects(convert_numeric=True)
dv.fit_transform(df.to_dict(orient='records'))
array([[ 5. , 0.2, 1. , 0. , 1. , 0. ],
[ 2. , 1.3, 1. , 0. , 0. , 1. ],
[ 1. , 2.3, 0. , 1. , 0. , 1. ]])
dv.feature_names_
holds correspondence to the columns:
[1, 2, '0=a', '0=b', '3=', '3=as']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With