Dummy variables from a string variable

Tags:

I would like to create dummy variables form this dataset:

DF<-structure(list(A = c(1, 2, 3, 4, 5), B = c("1,3,2", "2,1,3,6",    "3,2,5,1,7", "3,7,4,2,6,5", "4,10,7,3,5,6")), .Names = c("A", "B"),                row.names = c(NA, 5L), class = "data.frame") > DF   A                  B 1 1              1,3,2 2 2            2,1,3,6 3 3          3,2,5,1,7 4 4        3,7,4,2,6,5 5 5       4,10,7,3,5,6

Desired output shoud look like this:

A  1  2  3  4  5  6  7  8  9  10 1  1  1  1  0  0  0  0  0  0  0 2  1  1  1  0  0  1  0  0  0  0 3  1  1  1  0  1  0  1  0  0  0 4  0  1  1  1  1  1  1  0  0  0 5  0  0  1  1  1  1  1  0  0  1

Is there a efficient way to do such thing? I can use strsplit or ifelse. Original dataset is very large with many rows (>10k) and values in column B (>15k). Function dummy from package dummies don't work as I want to.

I also found simmilar case: Splitting one column into multiple columns. But the anwsers from the link above work really slow in my case (up to 15 minutes on my Dell i7-2630QM, 8Gb, Win7 64 bit, R 2.15.3 64bit).

Thank you in advance for your anwsers.

541

asked Apr 28 '13 20:04

Maciej

1 Answers

UPDATE

The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.

Install it, and use concat.split.expanded to get the desired result:

library(splitstackshape) concat.split.expanded(DF, "B", fill = 0, drop = TRUE) #   A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10 # 1 1    1    1    1    0    0    0    0    0    0    0 # 2 2    1    1    1    0    0    1    0    0    0    0 # 3 3    1    1    1    0    1    0    1    0    0    0 # 4 4    0    1    1    1    1    1    1    0    0    0 # 5 5    0    0    1    1    1    1    1    0    0    1

Original post

A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split(), can be found here.

The usage, for your example data, would be:

## Keeping the original column concat.split(DF, "B", structure="expanded") #   A            B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10 # 1 1        1,3,2   1   1   1  NA  NA  NA  NA  NA  NA   NA # 2 2      2,1,3,6   1   1   1  NA  NA   1  NA  NA  NA   NA # 3 3    3,2,5,1,7   1   1   1  NA   1  NA   1  NA  NA   NA # 4 4  3,7,4,2,6,5  NA   1   1   1   1   1   1  NA  NA   NA # 5 5 4,10,7,3,5,6  NA  NA   1   1   1   1   1  NA  NA    1  ## Dropping the original column concat.split(DF, "B", structure="expanded", drop.col=TRUE) #   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10 # 1 1   1   1   1  NA  NA  NA  NA  NA  NA   NA # 2 2   1   1   1  NA  NA   1  NA  NA  NA   NA # 3 3   1   1   1  NA   1  NA   1  NA  NA   NA # 4 4  NA   1   1   1   1   1   1  NA  NA   NA # 5 5  NA  NA   1   1   1   1   1  NA  NA    1

Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)

temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE) temp[is.na(temp)] <- 0 temp #   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10 # 1 1   1   1   1   0   0   0   0   0   0    0 # 2 2   1   1   1   0   0   1   0   0   0    0 # 3 3   1   1   1   0   1   0   1   0   0    0 # 4 4   0   1   1   1   1   1   1   0   0    0 # 5 5   0   0   1   1   1   1   1   0   0    1

Update

Most of the overhead in the concat.split function probably comes in things like converting from a matrix to a data.frame, renaming the columns, and so on. The actual code used to do the splitting is a GASP for loop, but test it out, and you'll find that it performs pretty well:

b = strsplit(DF$B, ",") ncol = max(as.numeric(unlist(b))) temp = lapply(b, as.numeric) ## Set up an empty matrix m = matrix(0, nrow = nrow(DF), ncol = ncol)       ## Fill it in for (i in 1:nrow(DF)) {   m[i, temp[[i]]] = 1 } ## View your result m

179

answered Sep 18 '22 06:09

A5C1D2H2I1M1N2O1R2T1

Related questions
                            
                                looping through json array in c#
                            
                                What is the default list of stopwords used in Lucene's StopFilter?
                            
                                How to save Excel Workbook to Desktop regardless of user?
                            
                                prevent undesired line wrapping in TextView
                            
                                using apicontroller vs odata EntitySetController [closed]
                            
                                Scroll two opened buffers in a split window at the same time
                            
                                Write StringIO to Tempfile
                            
                                How to retrieve only new data?
                            
                                Why Json test program doesn't work?
                            
                                Match (and delete) LF character in Notepad++ regex
                            
                                What is a callable object in C++?
                            
                                Java get local IP [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With