Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently splitting a character vector

Tags:

list

split

r

I have this style of a character vector:

vec <- c("id a; sex m; age 16; type 1;","id a; sex m; age 16;","id a; sex m; age 16; type 3")

Every element in vec is a "; " separated list of attributes where the each attribute has the "key value" format (the "; " character can only appear as the separator).

So the first list of attributes is: id=a sex=m age=16 type=1

Note that different elements in vec may have slightly different attributes.

I'm looking for an efficient way to split vec to a list of lists. Each element in the outer list is a list of all attribute values where the element names are the attribute keys. This means that the length of the outer list will be the length of the elements of vec and the length of each inner list will the the length of attributes.

I currently have this implementation, which help to understand the output I need:

attributes.list <- sapply(vec, function(x) strsplit(x, split = "(\\;)(\\s+)?", perl = TRUE)[[1]])
attributes.lol <- lapply(attributes.list, function(x) {
  attribute.mat <- sapply(x, function(y) strsplit(y, split = " ")[[1]])
  colnames(attribute.mat) <- NULL
  attribute.list <- as.list(attribute.mat[2,])
  names(attribute.list) <- attribute.mat[1,]
  return(attribute.list)
})

> attributes.lol[[1]]
$id
[1] "a"

$sex
[1] "m"

$age
[1] "16"

$type
[1] "1"

The length of vec in reality is very long (~million elements) so I was wondering if there's a more efficient way to achieve this.

like image 376
user1701545 Avatar asked Dec 24 '22 10:12

user1701545


1 Answers

I would suggest a combination of "iotools" and "data.table", something along the lines of this:

library(iotools)
library(data.table)
melt(data.table(ind = seq_along(vec), trimws(mstrsplit(vec, ";"))),
     "ind", na.rm = TRUE)[
      , c("key", "val") := tstrsplit(value, " ", TRUE)][
        , c("variable", "value") := NULL][]

Or, if you want a "wide" form (like @GGrothendieck's answer):

dcast(
  melt(data.table(ind = seq_along(vec), trimws(mstrsplit(vec, ";"))),
       "ind", na.rm = TRUE)[
         , c("key", "val") := tstrsplit(value, " ", TRUE)][
           , c("variable", "value") := NULL][], ind ~ key, value.var = "val")

I suggest the above because you mention that you want an efficient approach. Compare the following:

Sample data length 3, approximately 100000, and approximately 1 million.

vec <- c("id a; sex m; age 16; type 1;","id a; sex m; age 16;","id a; sex m; age 16; type 3")
v100k <- rep(vec, ceiling(100000/length(vec)))
v1M <- rep(vec, ceiling(1000000/length(vec)))

The approaches we want to test:

library(iotools)
library(data.table)

funAM_l <- function(invec) {
  melt(data.table(ind = seq_along(invec), trimws(mstrsplit(invec, ";"))), "ind", na.rm = TRUE)[
    , c("key", "val") := tstrsplit(value, " ", TRUE)][
      , c("variable", "value") := NULL][]
}

funAM_w <- function(invec) dcast(funAM_l(invec), ind ~ key, value.var = "val")

funMT <- function(v) {
  z <- strsplit(v, split = "(\\;)(\\s+)?", perl = TRUE)
  lapply(z,function(s) {v <- unlist(strsplit(s,' ')); setNames(as.list(v[c(F,T)]),v[c(T,F)]) })
}

funF <- function(invec) rbindlist(lapply(invec, function(x) { fread(gsub(";", "\n", x)) }), idcol = TRUE)

funGG <- function(invec) read.dcf(textConnection(sub(" ",": ",trimws(unlist(strsplit(paste0(invec, ";"),";"))))))

My suggestion isn't going to win any races with a small vector:

library(microbenchmark)
microbenchmark(funAM_l(vec), funAM_w(vec), funF(vec), funGG(vec), funMT(vec))
# Unit: microseconds
#          expr      min        lq       mean    median        uq      max neval
#  funAM_l(vec) 1474.163 1525.3765 1614.28414 1573.6325 1601.3815 2828.481   100
#  funAM_w(vec) 3293.376 3482.9510 3741.30381 3553.7240 3714.1730 6787.863   100
#     funF(vec)  690.761  729.4900  830.61645  756.4610  777.6725 4083.904   100
#    funGG(vec)  182.281  209.8405  220.46376  220.8055  232.1820  280.788   100
#    funMT(vec)   57.288   76.5225   84.81496   83.2755   90.3120  166.352   100

But look at what happens when we scale up the vectors:

system.time(funAM_l(v100k))
#    user  system elapsed 
#    0.24    0.00    0.24 
system.time(funAM_w(v100k))
#    user  system elapsed 
#   0.296   0.000   0.296 
system.time(funMT(v100k))
#    user  system elapsed 
#   1.768   0.000   1.768 
system.time(funF(v100k))
#    user  system elapsed 
#  21.960   0.136  22.068 
system.time(funGG(v100k))
#    user  system elapsed 
#  30.968   0.004  30.940 

Here's how it performs on a vector of length 1 million.

system.time(funAM_w(v1M))
#    user  system elapsed 
#   4.316   0.092   4.402 

My other suggestion was going to be to look at cSplit from my "splitstackshape" package. That is a little bit better performing than @Marat's approach.

Here it is on 1 million values:

library(splitstackshape)
system.time(dcast(
  cSplit(cSplit(data.table(ind = seq_along(v1M), v1M), "v1M", ";", "long"), "v1M", " "), 
  ind ~ v1M_1, value.var = "v1M_2"))
#    user  system elapsed 
#  13.744   0.156  13.882
like image 191
A5C1D2H2I1M1N2O1R2T1 Avatar answered Jan 10 '23 23:01

A5C1D2H2I1M1N2O1R2T1