Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One-hot encoding a text string [duplicate]

Tags:

r

data.table

I have a column which contains a mixed string of characters, I've created columns to represent each one of the unique characters in the string. I need to encode the columns with a [1,0] if any of the characters in the string matches one of these columns.

library(data.table)
d = data.table(string = c("P_P_F_", "U_F_/", "-_P_B"),
               P = c(1,  0, 1),
               F = c(1, 1, 0),
               U = c(0, 1, 0),
               B = c(0, 0, 1))

In the example above string has the characters I need matching to the corresponding columns. The first string has a P and F so I have a 1 in those columns and a 0 in the rest. The characters within the string are always separated by an underscore and has a maximum length of 7.

The data set is quite large so I would prefer a data.table solution is possible.

like image 756
MidnightDataGeek Avatar asked Mar 01 '23 16:03

MidnightDataGeek


1 Answers

We can use mtabulate after splitting the string

library(qdapTools)
cbind(d, mtabulate(strsplit(d$string, "[_/-]")))

data

d <- data.table(string = c("P_P_F_", "U_F_/", "-_P_B"))
like image 92
akrun Avatar answered Mar 09 '23 01:03

akrun