I would like to create dummy variables form this dataset:
DF<-structure(list(A = c(1, 2, 3, 4, 5), B = c("1,3,2", "2,1,3,6", "3,2,5,1,7", "3,7,4,2,6,5", "4,10,7,3,5,6")), .Names = c("A", "B"), row.names = c(NA, 5L), class = "data.frame") > DF A B 1 1 1,3,2 2 2 2,1,3,6 3 3 3,2,5,1,7 4 4 3,7,4,2,6,5 5 5 4,10,7,3,5,6
Desired output shoud look like this:
A 1 2 3 4 5 6 7 8 9 10 1 1 1 1 0 0 0 0 0 0 0 2 1 1 1 0 0 1 0 0 0 0 3 1 1 1 0 1 0 1 0 0 0 4 0 1 1 1 1 1 1 0 0 0 5 0 0 1 1 1 1 1 0 0 1
Is there a efficient way to do such thing? I can use strsplit
or ifelse
. Original dataset is very large with many rows (>10k) and values in column B (>15k). Function dummy
from package dummies
don't work as I want to.
I also found simmilar case: Splitting one column into multiple columns. But the anwsers from the link above work really slow in my case (up to 15 minutes on my Dell i7-2630QM, 8Gb, Win7 64 bit, R 2.15.3 64bit).
Thank you in advance for your anwsers.
The definition of a dummy dependent variable model is quite simple: If the dependent, response, left-hand side, or Y variable is a dummy variable, you have a dummy dependent variable model. The reason dummy dependent variable models are important is that they are everywhere.
Once a categorical variable has been recoded as a dummy variable, the dummy variable can be used in regression analysis just like any other quantitative variable.
The Dummy variable trap is a scenario where there are attributes that are highly correlated (Multicollinear) and one variable predicts the value of others. When we use one-hot encoding for handling the categorical data, then one dummy variable (attribute) can be predicted with the help of other dummy variables.
The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for
loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.
Install it, and use concat.split.expanded
to get the desired result:
library(splitstackshape) concat.split.expanded(DF, "B", fill = 0, drop = TRUE) # A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10 # 1 1 1 1 1 0 0 0 0 0 0 0 # 2 2 1 1 1 0 0 1 0 0 0 0 # 3 3 1 1 1 0 1 0 1 0 0 0 # 4 4 0 1 1 1 1 1 1 0 0 0 # 5 5 0 0 1 1 1 1 1 0 0 1
Original post
A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split()
, can be found here.
The usage, for your example data, would be:
## Keeping the original column concat.split(DF, "B", structure="expanded") # A B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10 # 1 1 1,3,2 1 1 1 NA NA NA NA NA NA NA # 2 2 2,1,3,6 1 1 1 NA NA 1 NA NA NA NA # 3 3 3,2,5,1,7 1 1 1 NA 1 NA 1 NA NA NA # 4 4 3,7,4,2,6,5 NA 1 1 1 1 1 1 NA NA NA # 5 5 4,10,7,3,5,6 NA NA 1 1 1 1 1 NA NA 1 ## Dropping the original column concat.split(DF, "B", structure="expanded", drop.col=TRUE) # A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10 # 1 1 1 1 1 NA NA NA NA NA NA NA # 2 2 1 1 1 NA NA 1 NA NA NA NA # 3 3 1 1 1 NA 1 NA 1 NA NA NA # 4 4 NA 1 1 1 1 1 1 NA NA NA # 5 5 NA NA 1 1 1 1 1 NA NA 1
Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)
temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE) temp[is.na(temp)] <- 0 temp # A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10 # 1 1 1 1 1 0 0 0 0 0 0 0 # 2 2 1 1 1 0 0 1 0 0 0 0 # 3 3 1 1 1 0 1 0 1 0 0 0 # 4 4 0 1 1 1 1 1 1 0 0 0 # 5 5 0 0 1 1 1 1 1 0 0 1
Most of the overhead in the concat.split
function probably comes in things like converting from a matrix
to a data.frame
, renaming the columns, and so on. The actual code used to do the splitting is a GASP for
loop, but test it out, and you'll find that it performs pretty well:
b = strsplit(DF$B, ",") ncol = max(as.numeric(unlist(b))) temp = lapply(b, as.numeric) ## Set up an empty matrix m = matrix(0, nrow = nrow(DF), ncol = ncol) ## Fill it in for (i in 1:nrow(DF)) { m[i, temp[[i]]] = 1 } ## View your result m
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With