Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recode categorical factor with N categories into N binary columns

Original data frame:

v1 = sample(letters[1:3], 10, replace=TRUE)
v2 = sample(letters[1:3], 10, replace=TRUE)
df = data.frame(v1,v2)
df
   v1 v2
1   b  c
2   a  a
3   c  c
4   b  a
5   c  c
6   c  b
7   a  a
8   a  b
9   a  c
10  a  b

New data frame:

new_df = data.frame(row.names=rownames(df))
for (i in colnames(df)) {
    for (x in letters[1:3]) {
        #new_df[x] = as.numeric(df[i] == x)
        new_df[paste0(i, "_", x)] = as.numeric(df[i] == x)
    }
}
   v1_a v1_b v1_c v2_a v2_b v2_c
1     0    1    0    0    0    1
2     1    0    0    1    0    0
3     0    0    1    0    0    1
4     0    1    0    1    0    0
5     0    0    1    0    0    1
6     0    0    1    0    1    0
7     1    0    0    1    0    0
8     1    0    0    0    1    0
9     1    0    0    0    0    1
10    1    0    0    0    1    0

For small datasets this is fine, but it becomes slow for much larger datasets.

Anyone know of a way to do this without using looping?

like image 253
Keith Hughitt Avatar asked Apr 24 '13 19:04

Keith Hughitt


2 Answers

Even better with the help of @AnandaMahto's search capabilities,

model.matrix(~ . + 0, data=df, contrasts.arg = lapply(df, contrasts, contrasts=FALSE))
#    v1a v1b v1c v2a v2b v2c
# 1    0   1   0   0   0   1
# 2    1   0   0   1   0   0
# 3    0   0   1   0   0   1
# 4    0   1   0   1   0   0
# 5    0   0   1   0   0   1
# 6    0   0   1   0   1   0
# 7    1   0   0   1   0   0
# 8    1   0   0   0   1   0
# 9    1   0   0   0   0   1
# 10   1   0   0   0   1   0

I think this is what you're looking for. I'd be happy to delete if it's not so. Thanks to @G.Grothendieck (once again) for the excellent usage of model.matrix!

cbind(with(df, model.matrix(~ v1 + 0)), with(df, model.matrix(~ v2 + 0)))
#    v1a v1b v1c v2a v2b v2c
# 1    0   1   0   0   0   1
# 2    1   0   0   1   0   0
# 3    0   0   1   0   0   1
# 4    0   1   0   1   0   0
# 5    0   0   1   0   0   1
# 6    0   0   1   0   1   0
# 7    1   0   0   1   0   0
# 8    1   0   0   0   1   0
# 9    1   0   0   0   0   1
# 10   1   0   0   0   1   0

Note: Your output is just:

with(df, model.matrix(~ v2 + 0))

Note 2: This gives a matrix. Fairly obvious, but still, wrap it with as.data.frame(.) if you want a data.frame.

like image 86
Arun Avatar answered Sep 26 '22 00:09

Arun


There is a function in caret's package that does what you require, dummyVars. Here is the example of it's usage taken from the authors documentation: http://topepo.github.io/caret/preprocess.html

library(earth)
data(etitanic)

dummies <- caret::dummyVars(survived ~ ., data = etitanic)
head(predict(dummies, newdata = etitanic))

  pclass.1st pclass.2nd pclass.3rd sex.female sex.male     age sibsp parch
1          1          0          0          1        0 29.0000     0     0
2          1          0          0          0        1  0.9167     1     2
3          1          0          0          1        0  2.0000     1     2
4          1          0          0          0        1 30.0000     1     2
5          1          0          0          1        0 25.0000     1     2
6          1          0          0          0        1 48.0000     0     0

The model.matrix options could be useful in case you had sparse data and wanted to use Matrix::sparse.model.matrix

like image 30
marbel Avatar answered Sep 24 '22 00:09

marbel