Let's assume that I have a pandas dataframe with the following column names:
'age'
(e.g. 33, 26, 51 etc)'seniority'
(e.g. 'junior', 'senior' etc)'gender'
(e.g. 'male', 'female')'salary'
(e.g. 32000, 40000, 64000 etc)I want to transform the seniority
categorical variables to one hot encoded values. For this reason I am doing the following:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['seniority'] = label_encoder.fit_transform(data['seniority'])
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(categorical_features=[1])
data = one_hot_encoder.fit_transform(data.values)
But then I am getting this error
ValueError: could not convert string to float: 'gender'
at line
data = one_hot_encoder.fit_transform(data.values)
However, I have explicitly specified that categorical_features=[1]
so only column 1 (seniority
) should be considered for this one hot encoding.
How can I fix this error (except for example by dropping the column 'gender')?
I was using pandas.get_dummies
in the past and I did not have this problem.
In this encoding technique order of categorical variables does not matters. Categorical data is converted into numeric data by splitting the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.
We only use 3-4 categorical columns from the dataset for applying one-hot encoding.
Limitation of label Encoding Label encoding converts the data in machine-readable form, but it assigns a unique number(starting from 0) to each class of data. This may lead to the generation of priority issues in the training of data sets.
I think for this case you should stick to pd.get_dummies
:
>>> data
age seniority gender salary
0 1 junior male 5
1 2 senior female 6
2 3 junior female 7
# One hot encode with get_dummies
data = pd.concat((data,pd.get_dummies(data.seniority)),1)
>>> data
age seniority gender salary junior senior
0 1 junior male 5 1 0
1 2 senior female 6 0 1
2 3 junior female 7 1 0
The problem is that sklearn
's OneHotEncoder
needs to have an array of ints as input. But in the array data.values
, you still have the string representation of gender
. You could, if you wanted, just one hot encode the seniority values, but if you want to know the meaning of those features, it's not very nice, you have to pass it the column names manually (which is unfeasible in a lot of cases):
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['seniority'] = label_encoder.fit_transform(data['seniority'])
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
data[['junior','senior']] = one_hot_encoder.fit_transform(data['seniority'].values.reshape(-1,1))
>>> data
age seniority gender salary junior senior
0 1 0 male 5 1.0 0.0
1 2 1 female 6 0.0 1.0
2 3 0 female 7 1.0 0.0
Or, if the feature names don't matter:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['seniority'] = label_encoder.fit_transform(data['seniority'])
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
data = pd.concat((data,pd.DataFrame(one_hot_encoder.fit_transform(data['seniority'].values.reshape(-1,1)))),1)
age seniority gender salary 0 1
0 1 0 male 5 1.0 0.0
1 2 1 female 6 0.0 1.0
2 3 0 female 7 1.0 0.0
But in the end, pd.get_dummies
does the job in a much nicer way (IMO)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With