Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dummy variables from a string variable

Tags:

I would like to create dummy variables form this dataset:

DF<-structure(list(A = c(1, 2, 3, 4, 5), B = c("1,3,2", "2,1,3,6",    "3,2,5,1,7", "3,7,4,2,6,5", "4,10,7,3,5,6")), .Names = c("A", "B"),                row.names = c(NA, 5L), class = "data.frame") > DF   A                  B 1 1              1,3,2 2 2            2,1,3,6 3 3          3,2,5,1,7 4 4        3,7,4,2,6,5 5 5       4,10,7,3,5,6 

Desired output shoud look like this:

A  1  2  3  4  5  6  7  8  9  10 1  1  1  1  0  0  0  0  0  0  0 2  1  1  1  0  0  1  0  0  0  0 3  1  1  1  0  1  0  1  0  0  0 4  0  1  1  1  1  1  1  0  0  0 5  0  0  1  1  1  1  1  0  0  1 

Is there a efficient way to do such thing? I can use strsplit or ifelse. Original dataset is very large with many rows (>10k) and values in column B (>15k). Function dummy from package dummies don't work as I want to.

I also found simmilar case: Splitting one column into multiple columns. But the anwsers from the link above work really slow in my case (up to 15 minutes on my Dell i7-2630QM, 8Gb, Win7 64 bit, R 2.15.3 64bit).

Thank you in advance for your anwsers.

like image 541
Maciej Avatar asked Apr 28 '13 20:04

Maciej


People also ask

Can a dependent variable be a dummy variable?

The definition of a dummy dependent variable model is quite simple: If the dependent, response, left-hand side, or Y variable is a dummy variable, you have a dummy dependent variable model. The reason dummy dependent variable models are important is that they are everywhere.

Can you use dummy variables in linear regression?

Once a categorical variable has been recoded as a dummy variable, the dummy variable can be used in regression analysis just like any other quantitative variable.

What is dummy variables trap?

The Dummy variable trap is a scenario where there are attributes that are highly correlated (Multicollinear) and one variable predicts the value of others. When we use one-hot encoding for handling the categorical data, then one dummy variable (attribute) can be predicted with the help of other dummy variables.


1 Answers

UPDATE

The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.

Install it, and use concat.split.expanded to get the desired result:

library(splitstackshape) concat.split.expanded(DF, "B", fill = 0, drop = TRUE) #   A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10 # 1 1    1    1    1    0    0    0    0    0    0    0 # 2 2    1    1    1    0    0    1    0    0    0    0 # 3 3    1    1    1    0    1    0    1    0    0    0 # 4 4    0    1    1    1    1    1    1    0    0    0 # 5 5    0    0    1    1    1    1    1    0    0    1 

Original post

A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split(), can be found here.

The usage, for your example data, would be:

## Keeping the original column concat.split(DF, "B", structure="expanded") #   A            B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10 # 1 1        1,3,2   1   1   1  NA  NA  NA  NA  NA  NA   NA # 2 2      2,1,3,6   1   1   1  NA  NA   1  NA  NA  NA   NA # 3 3    3,2,5,1,7   1   1   1  NA   1  NA   1  NA  NA   NA # 4 4  3,7,4,2,6,5  NA   1   1   1   1   1   1  NA  NA   NA # 5 5 4,10,7,3,5,6  NA  NA   1   1   1   1   1  NA  NA    1  ## Dropping the original column concat.split(DF, "B", structure="expanded", drop.col=TRUE) #   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10 # 1 1   1   1   1  NA  NA  NA  NA  NA  NA   NA # 2 2   1   1   1  NA  NA   1  NA  NA  NA   NA # 3 3   1   1   1  NA   1  NA   1  NA  NA   NA # 4 4  NA   1   1   1   1   1   1  NA  NA   NA # 5 5  NA  NA   1   1   1   1   1  NA  NA    1 

Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)

temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE) temp[is.na(temp)] <- 0 temp #   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10 # 1 1   1   1   1   0   0   0   0   0   0    0 # 2 2   1   1   1   0   0   1   0   0   0    0 # 3 3   1   1   1   0   1   0   1   0   0    0 # 4 4   0   1   1   1   1   1   1   0   0    0 # 5 5   0   0   1   1   1   1   1   0   0    1 

Update

Most of the overhead in the concat.split function probably comes in things like converting from a matrix to a data.frame, renaming the columns, and so on. The actual code used to do the splitting is a GASP for loop, but test it out, and you'll find that it performs pretty well:

b = strsplit(DF$B, ",") ncol = max(as.numeric(unlist(b))) temp = lapply(b, as.numeric) ## Set up an empty matrix m = matrix(0, nrow = nrow(DF), ncol = ncol)       ## Fill it in for (i in 1:nrow(DF)) {   m[i, temp[[i]]] = 1 } ## View your result m  
like image 179
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 18 '22 06:09

A5C1D2H2I1M1N2O1R2T1