Given a list of variant length features:
features = [
['f1', 'f2', 'f3'],
['f2', 'f4', 'f5', 'f6'],
['f1', 'f2']
]
where each sample has variant number of features and the feature dtype
is str
and already one hot.
In order to use feature selection utilities of sklearn, I have to convert the features
to a 2D-array which looks like:
f1 f2 f3 f4 f5 f6
s1 1 1 1 0 0 0
s2 0 1 0 1 1 1
s3 1 1 0 0 0 0
How could I achieve it via sklearn or numpy?
One-Hot Encoding is the process of creating dummy variables. This technique is used for categorical variables where order does not matter. One-Hot encoding technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created.
One-hot Encoding is a feature encoding strategy to convert categorical features into a numerical vector. For each feature value, the one-hot transformation creates a new feature demarcating the presence or absence of feature value.
Another approach is to encode categorical values with a technique called "label encoding", which allows you to convert each value in a column to a number. Numerical labels are always between 0 and n_categories-1. You can do label encoding via attributes . cat.
You can use MultiLabelBinarizer present in scikit which is specifically used for doing this.
Code for your example:
features = [
['f1', 'f2', 'f3'],
['f2', 'f4', 'f5', 'f6'],
['f1', 'f2']
]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
new_features = mlb.fit_transform(features)
Output:
array([[1, 1, 1, 0, 0, 0],
[0, 1, 0, 1, 1, 1],
[1, 1, 0, 0, 0, 0]])
This can also be used in a pipeline, along with other feature_selection utilities.
Here's one approach with NumPy methods and outputting as pandas dataframe -
import numpy as np
import pandas as pd
lens = list(map(len, features))
N = len(lens)
unq, col = np.unique(np.concatenate(features),return_inverse=1)
row = np.repeat(np.arange(N), lens)
out = np.zeros((N,len(unq)),dtype=int)
out[row,col] = 1
indx = ['s'+str(i+1) for i in range(N)]
df_out = pd.DataFrame(out, columns=unq, index=indx)
Sample input, output -
In [80]: features
Out[80]: [['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2']]
In [81]: df_out
Out[81]:
f1 f2 f3 f4 f5 f6
s1 1 1 1 0 0 0
s2 0 1 0 1 1 1
s3 1 1 0 0 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With