Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One-hot encoding multi-level column data

I have the following data frame where there are records with features about different subjects:

ID   Feature
-------------------------
1    A
1    B
2    A
1    A
3    B
3    B
1    C
2    C
3    D

I'd like to get another (aggregated?) data frame where each row represents a specific subject, and there are an exhaustive list of all one-hot encoded features:

ID   FEATURE_A FEATURE_B FEATURE_C FEATURE D
--------------------------------------------
1    1         1         1         0
2    1         0         1         0
3    0         1         0         0

How could it be implemented in Python (Pandas)?

Bonus: how could be implemented a version where the feature columns contain occurence numbers, not just binary flags?

like image 385
Hendrik Avatar asked Oct 17 '17 13:10

Hendrik


People also ask

How do you one-hot encode the column?

In this technique, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is Male, the value will be 1 in Male column and 0 in Female column, and vice-versa.

What are two limitations of using one-hot encoding?

Because this procedure generates several new variables, it is prone to causing a large problem (too many predictors) if the original column has a large number of unique values. Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy.

Can we use one-hot encoding for ordinal data?

In this case, a one-hot encoding can be applied to the ordinal representation. This is where the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable. Each bit represents a possible category.

What is the result of using one-hot encoding on the categorical column?

One-Hot Encoding is the process of creating dummy variables. This technique is used for categorical variables where order does not matter. One-Hot encoding technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created.


2 Answers

By using pd.crosstab

pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE ')
Out[805]: 
Feature  FEATURE A  FEATURE B  FEATURE C  FEATURE D
ID                                                 
1                1          1          1          0
2                1          0          1          0
3                0          1          0          1

Or using drop_duplicates then get_dummies

pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0)
Out[808]: 
    Feature_A  Feature_B  Feature_C  Feature_D
ID                                            
1           1          1          1          0
2           1          0          1          0
3           0          1          0          1

Additional Answer : how could be implemented a version where the feature columns contain occurence numbers, not just binary flags?

Option1

pd.crosstab(df.ID,df.Feature)
Out[809]: 
Feature  A  B  C  D
ID                 
1        2  1  1  0
2        1  0  1  0
3        0  2  0  1

Or

Option 2

pd.get_dummies(df.set_index('ID')).sum(level=0)
like image 76
BENY Avatar answered Sep 28 '22 11:09

BENY


Use join with get_dummies, then groupby and aggregate max:

df =df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max()
print (df)
    FEATURE_A  FEATURE_B  FEATURE_C  FEATURE_D
ID                                            
1           1          1          1          0
2           1          0          1          0
3           0          1          0          1

Detail:

print (pd.get_dummies(df['Feature']))
   A  B  C  D
0  1  0  0  0
1  0  1  0  0
2  1  0  0  0
3  1  0  0  0
4  0  1  0  0
5  0  1  0  0
6  0  0  1  0
7  0  0  1  0
8  0  0  0  1

Another solution with MultiLabelBinarizer and DataFrame constructor:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Feature']),
                   columns=['FEATURE_' + x for x in mlb.classes_], 
                   index=df.ID).max(level=0)
print (df1)
    FEATURE_A  FEATURE_B  FEATURE_C  FEATURE_D
ID                                            
1           1          1          1          0
2           1          0          1          0
3           0          1          0          1

Timings:

np.random.seed(123)
N = 100000
L = list('abcdefghijklmno'.upper()) 
df = pd.DataFrame({'Feature': np.random.choice(L, N),
                   'ID':np.random.randint(10000,size=N)})

def jez(df):
    mlb = MultiLabelBinarizer()
    return pd.DataFrame(mlb.fit_transform(df['Feature']),
                   columns=['FEATURE_' + x for x in mlb.classes_], 
                   index=df.ID).max(level=0)


#jez1
In [464]: %timeit (df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max())
10 loops, best of 3: 39.3 ms per loop

In [465]: %timeit (jez(df))
10 loops, best of 3: 138 ms per loop

#Scott Boston1
In [466]: %timeit (df.set_index('ID')['Feature'].str.get_dummies().add_prefix('FEATURE_').max(level=0))
1 loop, best of 3: 1.03 s per loop

#wen1
In [467]: %timeit (pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE '))
1 loop, best of 3: 383 ms per loop

#wen2
In [468]: %timeit (pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0))
10 loops, best of 3: 47 ms per loop

Caveat

The results do not address performance given the proportion of Feature and ID, which will affect timings a lot for some of these solutions.
like image 25
jezrael Avatar answered Sep 28 '22 11:09

jezrael