Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to one-hot-encode factor variables with data.table?

Tags:

r

data.table

For those unfamiliar, one-hot encoding simply refers to converting a column of categories (i.e. a factor) into multiple columns of binary indicator variables where each new column corresponds to one of the classes of the original column. This example will explain it better:

dt <- data.table(
  ID=1:5, 
  Color=factor(c("green", "red", "red", "blue", "green"), levels=c("blue", "green", "red", "purple")),
  Shape=factor(c("square", "triangle", "square", "triangle", "cirlce"))
)

dt
   ID Color    Shape
1:  1 green   square
2:  2   red triangle
3:  3   red   square
4:  4  blue triangle
5:  5 green   cirlce

# one hot encode the colors
color.binarized <- dcast(dt[, list(V1=1, ID, Color)], ID ~ Color, fun=sum, value.var="V1", drop=c(TRUE, FALSE))

# Prepend Color_ in front of each one-hot-encoded feature
setnames(color.binarized, setdiff(colnames(color.binarized), "ID"), paste0("Color_", setdiff(colnames(color.binarized), "ID")))

# one hot encode the shapes
shape.binarized <- dcast(dt[, list(V1=1, ID, Shape)], ID ~ Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))

# Prepend Shape_ in front of each one-hot-encoded feature
setnames(shape.binarized, setdiff(colnames(shape.binarized), "ID"), paste0("Shape_", setdiff(colnames(shape.binarized), "ID")))

# Join one-hot tables with original dataset
dt <- dt[color.binarized, on="ID"]
dt <- dt[shape.binarized, on="ID"]

dt
   ID Color    Shape Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
1:  1 green   square          0           1         0            0            0            1              0
2:  2   red triangle          0           0         1            0            0            0              1
3:  3   red   square          0           0         1            0            0            1              0
4:  4  blue triangle          1           0         0            0            0            0              1
5:  5 green   cirlce          0           1         0            0            1            0              0

This is something I do a lot, and as you can see it's pretty tedious (especially when my data has many factor columns). Is there an easier way to do this with data.table? In particular, I assumed dcast would allow me to one-hot-encode multiple columns at once, when I try doing something like

dcast(dt[, list(V1=1, ID, Color, Shape)], ID ~ Color + Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))

I get column combinations

   ID blue_cirlce blue_square blue_triangle green_cirlce green_square green_triangle red_cirlce red_square red_triangle purple_cirlce purple_square purple_triangle
1:  1           0           0             0            0            1              0          0          0            0             0             0               0
2:  2           0           0             0            0            0              0          0          0            1             0             0               0
3:  3           0           0             0            0            0              0          0          1            0             0             0               0
4:  4           0           0             1            0            0              0          0          0            0             0             0               0
5:  5           0           0             0            1            0              0          0          0            0             0             0               0
like image 652
Ben Avatar asked Oct 06 '16 21:10

Ben


People also ask

How do I use one-hot encoded data?

One-Hot encoding technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created. Categorical features are mapped with a binary variable containing either 0 or 1.

How do you one-hot encode the column?

For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function. This returns a new dataframe with a column for every "level" of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.


1 Answers

Using model.matrix:

> cbind(dt[, .(ID)], model.matrix(~ Color + Shape, dt))
   ID (Intercept) Colorgreen Colorred Colorpurple Shapesquare Shapetriangle
1:  1           1          1        0           0           1             0
2:  2           1          0        1           0           0             1
3:  3           1          0        1           0           1             0
4:  4           1          0        0           0           0             1
5:  5           1          1        0           0           0             0

This makes the most sense if you're doing modelling.

If you want to suppress the intercept (and restore the aliased column for the 1st variable):

> cbind(dt[, .(ID)], model.matrix(~ Color + Shape - 1, dt))
   ID Colorblue Colorgreen Colorred Colorpurple Shapesquare Shapetriangle
1:  1         0          1        0           0           1             0
2:  2         0          0        1           0           0             1
3:  3         0          0        1           0           1             0
4:  4         1          0        0           0           0             1
5:  5         0          1        0           0           0             0
like image 154
Hong Ooi Avatar answered Oct 03 '22 02:10

Hong Ooi