One hot encoding of string categorical features

Tags:

I'm trying to perform a one hot encoding of a trivial dataset.

data = [['a', 'dog', 'red']         ['b', 'cat', 'green']]

What's the best way to preprocess this data using Scikit-Learn?

On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.

So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.

So, what's the best way to do it in Scikit-Learn?

Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.

469

asked Jan 30 '16 21:01

hlin117

1 Answers

If you are on sklearn>0.20.dev0

In [11]: from sklearn.preprocessing import OneHotEncoder     ...: cat = OneHotEncoder()     ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T     ...: cat.fit_transform(X).toarray()     ...:  Out[11]: array([[1., 0., 0., 1., 0.],            [0., 1., 0., 0., 1.],            [1., 0., 0., 1., 0.],            [0., 0., 1., 0., 1.]])

If you are on sklearn==0.20.dev0

In [30]: cat = CategoricalEncoder()  In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T  In [32]: cat.fit_transform(X).toarray() Out[32]: array([[ 1.,  0., 0.,  1.,  0.],        [ 0.,  1.,  0.,  0.,  1.],        [ 1.,  0.,  0.,  1.,  0.],        [ 0.,  0.,  1.,  0.,  1.]])

Another way to do it is to use category_encoders.

Here is an example:

% pip install category_encoders import category_encoders as ce le =  ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore") X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']]) le.fit_transform(X) array([[1, 0, 1, 0, 1, 0],        [0, 1, 0, 1, 0, 1]])

answered Sep 18 '22 12:09

zipp

Related questions
                            
                                Remove namespace and prefix from xml in python using lxml
                            
                                Remove "add another" in Django admin screen
                            
                                Ensure a single instance of an application in Linux
                            
                                How can I simplify this conversion from underscore to camelcase in Python?
                            
                                replace special characters in a string python
                            
                                Convert a number to a list of integers
                            
                                Generate random colors (RGB)
                            
                                Best way to get query string from a URL in python?
                            
                                Debugging a Python Extension in Eclipse
                            
                                How to write a Twisted client plugin
                            
                                Using setup.py to install python project as a systemd service
                            
                                Accepting output of the socket generated by Python in MQL5
                            
                                pypy import clr fails on Windows
                            
                                PySpark serialization EOFError
                            
                                Compile PyPy to Exe
                            
                                How to import your package/modules from a script in bin folder in python
                            
                                Custom TensorFlow Keras optimizer
                            
                                multiprocessing: How can I ʀᴇʟɪᴀʙʟʏ redirect stdout from a child process?
                            
                                Grid search for hyperparameter evaluation of clustering in scikit-learn
                            
                                Improve pandas (PyTables?) HDF5 table write performance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

One hot encoding of string categorical features

Tags:

python

encoding

one-hot-encoding

scikit-learn

hlin117

People also ask

1 Answers

zipp

Recent Activity

Donate For Us