I have the following data frame where there are records with features about different subjects:
ID Feature
-------------------------
1 A
1 B
2 A
1 A
3 B
3 B
1 C
2 C
3 D
I'd like to get another (aggregated?) data frame where each row represents a specific subject, and there are an exhaustive list of all one-hot encoded features:
ID FEATURE_A FEATURE_B FEATURE_C FEATURE D
--------------------------------------------
1 1 1 1 0
2 1 0 1 0
3 0 1 0 0
How could it be implemented in Python (Pandas)?
Bonus: how could be implemented a version where the feature columns contain occurence numbers, not just binary flags?
In this technique, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is Male, the value will be 1 in Male column and 0 in Female column, and vice-versa.
Because this procedure generates several new variables, it is prone to causing a large problem (too many predictors) if the original column has a large number of unique values. Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy.
In this case, a one-hot encoding can be applied to the ordinal representation. This is where the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable. Each bit represents a possible category.
One-Hot Encoding is the process of creating dummy variables. This technique is used for categorical variables where order does not matter. One-Hot encoding technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created.
By using pd.crosstab
pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE ')
Out[805]:
Feature FEATURE A FEATURE B FEATURE C FEATURE D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
Or using drop_duplicates
then get_dummies
pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0)
Out[808]:
Feature_A Feature_B Feature_C Feature_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
Additional Answer : how could be implemented a version where the feature columns contain occurence numbers, not just binary flags?
Option1
pd.crosstab(df.ID,df.Feature)
Out[809]:
Feature A B C D
ID
1 2 1 1 0
2 1 0 1 0
3 0 2 0 1
Or
Option 2
pd.get_dummies(df.set_index('ID')).sum(level=0)
Use join
with get_dummies
, then groupby
and aggregate max
:
df =df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max()
print (df)
FEATURE_A FEATURE_B FEATURE_C FEATURE_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
Detail:
print (pd.get_dummies(df['Feature']))
A B C D
0 1 0 0 0
1 0 1 0 0
2 1 0 0 0
3 1 0 0 0
4 0 1 0 0
5 0 1 0 0
6 0 0 1 0
7 0 0 1 0
8 0 0 0 1
Another solution with MultiLabelBinarizer and DataFrame
constructor:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Feature']),
columns=['FEATURE_' + x for x in mlb.classes_],
index=df.ID).max(level=0)
print (df1)
FEATURE_A FEATURE_B FEATURE_C FEATURE_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
Timings:
np.random.seed(123)
N = 100000
L = list('abcdefghijklmno'.upper())
df = pd.DataFrame({'Feature': np.random.choice(L, N),
'ID':np.random.randint(10000,size=N)})
def jez(df):
mlb = MultiLabelBinarizer()
return pd.DataFrame(mlb.fit_transform(df['Feature']),
columns=['FEATURE_' + x for x in mlb.classes_],
index=df.ID).max(level=0)
#jez1
In [464]: %timeit (df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max())
10 loops, best of 3: 39.3 ms per loop
In [465]: %timeit (jez(df))
10 loops, best of 3: 138 ms per loop
#Scott Boston1
In [466]: %timeit (df.set_index('ID')['Feature'].str.get_dummies().add_prefix('FEATURE_').max(level=0))
1 loop, best of 3: 1.03 s per loop
#wen1
In [467]: %timeit (pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE '))
1 loop, best of 3: 383 ms per loop
#wen2
In [468]: %timeit (pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0))
10 loops, best of 3: 47 ms per loop
Feature
and ID
, which will affect timings a lot for some of these solutions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With