How to one hot encode variant length features?

Tags:

Given a list of variant length features:

features = [
    ['f1', 'f2', 'f3'],
    ['f2', 'f4', 'f5', 'f6'],
    ['f1', 'f2']
]

where each sample has variant number of features and the feature dtype is str and already one hot.

In order to use feature selection utilities of sklearn, I have to convert the features to a 2D-array which looks like:

    f1  f2  f3  f4  f5  f6
s1   1   1   1   0   0   0
s2   0   1   0   1   1   1
s3   1   1   0   0   0   0

How could I achieve it via sklearn or numpy?

535

asked Feb 22 '17 12:02

Zelong

Video Answer

2 Answers

You can use MultiLabelBinarizer present in scikit which is specifically used for doing this.

Code for your example:

features = [
            ['f1', 'f2', 'f3'],
            ['f2', 'f4', 'f5', 'f6'],
            ['f1', 'f2']
           ]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
new_features = mlb.fit_transform(features)

Output:

array([[1, 1, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 1],
       [1, 1, 0, 0, 0, 0]])

This can also be used in a pipeline, along with other feature_selection utilities.

answered Oct 07 '22 18:10

Vivek Kumar

Here's one approach with NumPy methods and outputting as pandas dataframe -

import numpy as np
import pandas as pd

lens = list(map(len, features))
N = len(lens)
unq, col = np.unique(np.concatenate(features),return_inverse=1)
row = np.repeat(np.arange(N), lens)
out = np.zeros((N,len(unq)),dtype=int)
out[row,col] = 1

indx = ['s'+str(i+1) for i in range(N)]
df_out = pd.DataFrame(out, columns=unq, index=indx)

Sample input, output -

In [80]: features
Out[80]: [['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2']]

In [81]: df_out
Out[81]: 
    f1  f2  f3  f4  f5  f6
s1   1   1   1   0   0   0
s2   0   1   0   1   1   1
s3   1   1   0   0   0   0

answered Oct 07 '22 18:10

Divakar

Related questions
                            
                                Single worker thread for all tasks or multiple specific workers?
                            
                                How to remove the adjacent duplicate value in a numpy array?
                            
                                Appending more datasets into an existing Hdf5 file without deleting other groups and datasets
                            
                                What effect do the different URL parameters of the Sphinx HTML output's search feature have?
                            
                                multi_line hover in bokeh
                            
                                Set PYTHONPATH for cron jobs in shared hosting
                            
                                Spoofing IP address when web scraping (python)
                            
                                Ordering users by date created in django admin panel
                            
                                Pandas groupby object filtering
                            
                                PyJWT returning invalid token signatures
                            
                                iPython with different env (using anaconda)
                            
                                How to set gunicorn limit_request_line parameter over 8190?
                            
                                Create NumberLong integer using PyMongo
                            
                                How to create a multilevel dataframe in pandas?
                            
                                Python: Copying named tuples with same attributes / fields
                            
                                pymongo update_one(), upsert=True without using $ operators
                            
                                Tensorflow MNIST: terminate called after throwing an instance of 'std::bad_alloc'
                            
                                Django url warning urls.W002
                            
                                Using replace efficiently in pandas
                            
                                Pandas TimeGrouper and Pivot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to one hot encode variant length features?

Tags:

python

pandas

numpy

scikit-learn