Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cSplit library(splitstackshape) is always dropping the column

I was searching for a way to split the column content by a separator and converting a table into a long format. I found cSplit from the splitstackshape package and it is almost doing what I was looking for.

Problem is now with the drop option. I expected my split column to be copied in a way, but this does not happen. Am I doing it wrong? Somebody experienced the problem?

I am not sure if I do something wrong, but the drop = FALSE option is not working in my case.

Here is an example:

library(splitstackshape)
jnk <- data.table(a = '1,2,3,4,5', b = 5)
jnk
#            a b
# 1: 1,2,3,4,5 5

cSplit(jnk, 'a', ',', 'long', drop = FALSE)
#    a b
# 1: 1 5
# 2: 2 5
# 3: 3 5
# 4: 4 5
# 5: 5 5

What I expected was something like this:

cSplit(jnk, 'a', ',', 'long', drop = FALSE)
#    a b    a.orig
# 1: 1 5 1,2,3,4,5
# 2: 2 5 1,2,3,4,5
# 3: 3 5 1,2,3,4,5
# 4: 4 5 1,2,3,4,5
# 5: 5 5 1,2,3,4,5

I am using version 1.4.2

like image 467
drmariod Avatar asked May 13 '15 06:05

drmariod


1 Answers

The "long" format modifies the column in place using list(unlist(...)) within "data.table", assigned with :=. Hence, if drop were used, you would be splitting the column and then removing it!

I'll try to make it explicit in the documentation that drop is only for the wide format, or add a message if a user attempts to use drop in the long format. Feel free to file a FR or submit a PR.

The workaround would be to assign another column (say, "a_orig") and then do the splitting:

jnk <- data.table(a=c('1,2,3,4,5','1,2,3','2,3'),b=c(5,4,3))
cSplit(jnk[, a_orig := a], "a", ",", "long")
#     a b    a_orig
#  1: 1 5 1,2,3,4,5
#  2: 2 5 1,2,3,4,5
#  3: 3 5 1,2,3,4,5
#  4: 4 5 1,2,3,4,5
#  5: 5 5 1,2,3,4,5
#  6: 1 4     1,2,3
#  7: 2 4     1,2,3
#  8: 3 4     1,2,3
#  9: 2 3       2,3
# 10: 3 3       2,3

I haven't tested extensively, but a possible fix could be:

cSplit2 <- function(indt, splitCols, sep = ",", direction = "wide", 
                    fixed = TRUE, drop = TRUE, stripWhite = TRUE, 
                    makeEqual = NULL, type.convert = TRUE) {
  if (direction == "long" & !drop) {
    indt <- as.data.table(indt)
    indt[, `:=`(eval(paste(splitCols, "orig", sep = "_")),
                lapply(splitCols, function(x) indt[[x]]))]
  }
  cSplit(indt, splitCols, sep, direction, fixed, drop, stripWhite, 
         makeEqual, type.convert)
}

The basic idea is to only change the input dataset if direction == "wide" and drop = FALSE. This is similar to the idea that you had, but can possibly be the solution integrated into the actual package, somewhere around line 94. In this case, only the indt[, `:=`(eval(paste(splitCols, "orig", sep = "_")), lapply(splitCols, function(x) indt[[x]]))] part should be necessary.

like image 64
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 21 '22 09:10

A5C1D2H2I1M1N2O1R2T1