Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do one-hot-encoding based on a combination of two categorical columns using pandas?

I'm trying to do some one-hot encoding with pandas. The input data looks like this:

import pandas as pd

# input data
df = pd.DataFrame({
    "pid"  : [1, 1, 2, 3],
    "code" : ["a", "b", "b", "c"],
    "time" : [0, 0, 1, 0]
})

# two of the columns are categories
df["code"] = df.code.astype("category")
df["time"] = df.time.astype("category")

And I want to one-hot-encode the combination of the columns code and time in a manner that produces 6 columns corresponding to all combinations of the categories in each column. In this small example that would be a_0, a_1, b_0, b_1, c_0, and c_1.

# i've tried doing this:
df["code_time"] = df.code.astype("str") + df.time.astype("str")
del df["code"]
del df["time"]
pd.get_dummies(df, columns=["code_time"])

But that wont give me the a_1 and c_1 combinations since they do not appear in the data. Is there any straightforward method to produce my desired output:

df_out = pd.DataFrame({
    "pid"  : [1, 2, 3]
    "a_0"  : [1, 0, 0]
    "a_1"  : [0, 0, 0]
    "b_0"  : [1, 0, 0]
    "b_1"  : [0, 1, 0]
    "c_0"  : [0, 0, 1]
    "c_1"  : [0, 0, 0]
})
like image 988
Peter H. Avatar asked Mar 01 '23 17:03

Peter H.


2 Answers

A straightforward solution is to create the full product:

from itertools import product
code_times = [f'{x}_{y}' for x,y in product(df['code'].unique(), df['time'].unique())]

data = df.code.astype("str") + '_'+ df.time.astype("str")

# use crosstab for get_dummies() + sum by id
pd.crosstab(df['pid'], data).reindex(code_times, fill_value=0, axis=1)

Output:

col_0  a_0  a_1  b_0  b_1  c_0  c_1
pid                                
1        1    0    1    0    0    0
2        0    0    0    1    0    0
3        0    0    0    0    1    0
like image 173
Quang Hoang Avatar answered Apr 28 '23 06:04

Quang Hoang


You can try pd.MultiIndex.from_product with reindex after series.str.get_dummies() followed by groupby on pid and max.

idx = pd.MultiIndex.from_product((df['code'].unique()
          ,df['time'].astype(str).unique())).map('_'.join)

out = (df['code'].str.cat(df['time'].astype(str),'_').str.get_dummies()
       .reindex(idx,axis=1,fill_value=0).groupby(df['pid']).max().reset_index())

print(out)

   pid  a_0  a_1  b_0  b_1  c_0  c_1
0    1    1    0    1    0    0    0
1    2    0    0    0    1    0    0
2    3    0    0    0    0    1    0
like image 23
anky Avatar answered Apr 28 '23 07:04

anky