I'm trying to perform a one hot encoding of a trivial dataset.
data = [['a', 'dog', 'red'] ['b', 'cat', 'green']]
What's the best way to preprocess this data using Scikit-Learn?
On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.
So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.
So, what's the best way to do it in Scikit-Learn?
Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.
One-Hot Encoding is the process of creating dummy variables. This technique is used for categorical variables where order does not matter. One-Hot encoding technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created.
In this encoding scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then the numbers are transformed in the binary number. After that binary value is split into different columns. Binary encoding works really well when there are a high number of categories.
What challenges you may face if you have applied OHE on a categorical variable of train dataset? A) All categories of categorical variable are not present in the test dataset. B) Frequency distribution of categories is different in train as compared to the test dataset.
If you are on sklearn>0.20.dev0
In [11]: from sklearn.preprocessing import OneHotEncoder ...: cat = OneHotEncoder() ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T ...: cat.fit_transform(X).toarray() ...: Out[11]: array([[1., 0., 0., 1., 0.], [0., 1., 0., 0., 1.], [1., 0., 0., 1., 0.], [0., 0., 1., 0., 1.]])
If you are on sklearn==0.20.dev0
In [30]: cat = CategoricalEncoder() In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T In [32]: cat.fit_transform(X).toarray() Out[32]: array([[ 1., 0., 0., 1., 0.], [ 0., 1., 0., 0., 1.], [ 1., 0., 0., 1., 0.], [ 0., 0., 1., 0., 1.]])
Another way to do it is to use category_encoders.
Here is an example:
% pip install category_encoders import category_encoders as ce le = ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore") X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']]) le.fit_transform(X) array([[1, 0, 1, 0, 1, 0], [0, 1, 0, 1, 0, 1]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With