Split different lengths values and bind to columns

Tags:

I've got a rather large (around 100k observations) data set, similar to this:

data <- data.frame(
                 ID = seq(1, 5, 1),
                 Values = c("1,2,3", "4", " ", "4,1,6,5,1,1,6", "0,0"), 
                 stringsAsFactors=F)
data
  ID        Values
1  1         1,2,3
2  2             4
3  3              
4  4 4,1,6,5,1,1,6
5  5           0,0

I want to split the Values column by "," with NA for missed cells:

ID v1 v2 v3 v4 v5 v6 v7
1  1  2  3  NA NA NA NA
2  4  NA NA NA NA NA NA
3  NA NA NA NA NA NA NA
4  4  1  6  5  1  1  6
5  0  0  NA NA NA NA NA
...

Best attempt was strsplit + rbind:

df <- data.frame(do.call(
                        "rbind",
                        strsplit(as.character(data$Values), split = "," , fixed = FALSE)
                        ))

But rbind function just recycles all 'short' rows instead to set an "NA". Have found similar problem

Many thanks, Leo

898

asked Aug 11 '14 13:08

Leo

2 Answers

I would suggest looking at my cSplit function or approaching the problem manually.

The cSplit approach would simply be:

cSplit(data, "Values", ",")
#    ID Values_1 Values_2 Values_3 Values_4 Values_5 Values_6 Values_7
# 1:  1        1        2        3       NA       NA       NA       NA
# 2:  2        4       NA       NA       NA       NA       NA       NA
# 3:  3                NA       NA       NA       NA       NA       NA
# 4:  4        4        1        6        5        1        1        6
# 5:  5        0        0       NA       NA       NA       NA       NA

Approaching the problem manually would look like:

## Split up the values
Split <- strsplit(data$Values, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),
            ncol = max(Ncol), 
            dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol), 
        sequence(Ncol))] <- unlist(Split, use.names = FALSE)
## Bind the values back together, here as a "data.table" (faster)
data.table(ID = data$ID, M)

^^ That's pretty much what goes on in cSplit, but the function has a few other options and some basic error checking and so on that might make it a little bit slower than a purely manual approach (or a function written to address your specific problem).

Both of these approaches would be faster than a "data.table" + "reshape2" approach. Also, since each row is treated individually, you shouldn't have any problems even if you have duplicated ID values--your output should have the same number of rows as your input.

Benchmarks

I've done benchmarks on more rows and on data that would give "wider" results (since that's implied in your comments to David's answer).

Here is the sample data:

set.seed(1)
a <- sample(0:100, 100000, TRUE)
Values <- vapply(a, function(x) 
  paste(sample(0:100, x, TRUE), collapse = ","), character(1L))
Values[sample(length(Values), length(Values) * .15)] <- ""
ID <- c(1:80000, 1:20000)
data <- data.frame(ID, Values, stringsAsFactors = FALSE)
DT <- as.data.table(data)

Here are the functions to test:

fun1a <- function(inDT) {
  data2 <- DT[, list(Values = unlist(
    strsplit(Values, ","))), by = ID]
  data2[, Var := paste0("v", seq_len(.N)), by = ID] 
  dcast.data.table(data2, ID ~ Var, 
                   fill = NA_character_, 
                   value.var = "Values")
}

fun1b <- function(inDT) {
  data2 <- DT[, list(Values = unlist(
    strsplit(Values, ",", fixed = TRUE), 
    use.names = FALSE)), by = ID]
  data2[, Var := paste0("v", seq_len(.N)), by = ID] 
  dcast.data.table(data2, ID ~ Var, 
                   fill = NA_character_, 
                   value.var = "Values")
}

fun2 <- function(inDT) {
  cSplit(DT, "Values", ",")
}

fun3 <- function(inDF) {
  Split <- strsplit(inDF$Values, ",", fixed = TRUE)
  Ncol <- vapply(Split, length, 1L)
  M <- matrix(NA_character_, nrow = nrow(inDF),
              ncol = max(Ncol), 
              dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
  M[cbind(rep(1:nrow(inDF), Ncol), 
          sequence(Ncol))] <- unlist(Split, use.names = FALSE)
  data.table(ID = inDF$ID, M)
}

Here are the results:

library(microbenchmark)
microbenchmark(fun2(DT), fun3(data), times = 20)
# Unit: seconds
#        expr      min       lq   median       uq      max neval
#    fun2(DT) 4.810942 5.173103 5.498279 5.622279 6.003339    20
#  fun3(data) 3.847228 3.929311 4.058728 4.160082 4.664568    20

## Didn't want to microbenchmark here...
system.time(fun1a(DT))
#    user  system elapsed 
#   16.92    0.50   17.59
system.time(fun1b(DT))  # fixed = TRUE & use.names = FALSE
#    user  system elapsed 
#   11.54    0.42   12.01

NOTE: The results of fun1a and fun1b would not be the same as those of fun2 and fun3 because of the duplicated IDs.

178

answered Nov 04 '22 00:11

A5C1D2H2I1M1N2O1R2T1

Here's a data.table combined with reshape2 approach (should be very efficient)

library(data.table) # Loading `data.table` package
data2 <- setDT(data)[, list(Values = unlist(strsplit(Values, ","))), by = ID] # splitting the values by `,` for each `ID`
data2[, Var := paste0("v", seq_len(.N)), by = ID] # Adding the `Var` variable

library(reshape2) # Loading `reshape2` package
dcast.data.table(data2, ID ~ Var, fill = NA_character_, value.var = "Values") # decasting

#    ID v1 v2 v3 v4 v5 v6 v7
# 1:  1  1  2  3 NA NA NA NA
# 2:  2  4 NA NA NA NA NA NA
# 3:  3    NA NA NA NA NA NA
# 4:  4  4  1  6  5  1  1  6
# 5:  5  0  0 NA NA NA NA NA

answered Nov 03 '22 23:11

David Arenburg

Related questions
                            
                                How to fork processes in R
                            
                                How to save a VERY LARGE .rda file in R package
                            
                                What's happening when changing "class" of an S4 object using the class function?
                            
                                Backreferencing in R (Regular Expressions)
                            
                                Test if two `data.table`s point to the same memory location [duplicate]
                            
                                Computing pairwise distances between a set of intervals
                            
                                na.locf and inverse.rle in Rcpp
                            
                                Shiny rcharts size of the graph?
                            
                                Importing Sea Surface Temperature text files in ASCII format into R
                            
                                Problems in Knitting Html in preview release of RStudio
                            
                                Plotting group distances in R
                            
                                Load data object when package is loaded
                            
                                How can one set the size of an igraph plot?
                            
                                Equivalent of "xlim" or "ylim" for vertical/horizontal abline?
                            
                                Legend on survival plot
                            
                                Heatmap Transparency, Coloring and Specificity not Satisfying
                            
                                Read data from memory in Vowpal Wabbit?
                            
                                Shiny + ggplot: How to subset reactive data object?
                            
                                Long lag time importing large .CSV's in R WITH header in second row
                            
                                data.table's fread() giving unwanted download message

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split different lengths values and bind to columns

Tags:

r

data.table

rbind

strsplit

Leo

People also ask

2 Answers

Benchmarks

A5C1D2H2I1M1N2O1R2T1

David Arenburg

Recent Activity

Donate For Us