I have this style of a character vector:
vec <- c("id a; sex m; age 16; type 1;","id a; sex m; age 16;","id a; sex m; age 16; type 3")
Every element in vec
is a "; " separated list of attributes where the each attribute has the "key value" format (the "; " character can only appear as the separator).
So the first list of attributes is: id=a sex=m age=16 type=1
Note that different elements in vec
may have slightly different attributes.
I'm looking for an efficient way to split vec
to a list of lists. Each element in the outer list is a list of all attribute values where the element names are the attribute keys. This means that the length of the outer list will be the length of the elements of vec
and the length of each inner list will the the length of attributes.
I currently have this implementation, which help to understand the output I need:
attributes.list <- sapply(vec, function(x) strsplit(x, split = "(\\;)(\\s+)?", perl = TRUE)[[1]])
attributes.lol <- lapply(attributes.list, function(x) {
attribute.mat <- sapply(x, function(y) strsplit(y, split = " ")[[1]])
colnames(attribute.mat) <- NULL
attribute.list <- as.list(attribute.mat[2,])
names(attribute.list) <- attribute.mat[1,]
return(attribute.list)
})
> attributes.lol[[1]]
$id
[1] "a"
$sex
[1] "m"
$age
[1] "16"
$type
[1] "1"
The length of vec
in reality is very long (~million elements) so I was wondering if there's a more efficient way to achieve this.
I would suggest a combination of "iotools" and "data.table", something along the lines of this:
library(iotools)
library(data.table)
melt(data.table(ind = seq_along(vec), trimws(mstrsplit(vec, ";"))),
"ind", na.rm = TRUE)[
, c("key", "val") := tstrsplit(value, " ", TRUE)][
, c("variable", "value") := NULL][]
Or, if you want a "wide" form (like @GGrothendieck's answer):
dcast(
melt(data.table(ind = seq_along(vec), trimws(mstrsplit(vec, ";"))),
"ind", na.rm = TRUE)[
, c("key", "val") := tstrsplit(value, " ", TRUE)][
, c("variable", "value") := NULL][], ind ~ key, value.var = "val")
I suggest the above because you mention that you want an efficient approach. Compare the following:
Sample data length 3, approximately 100000, and approximately 1 million.
vec <- c("id a; sex m; age 16; type 1;","id a; sex m; age 16;","id a; sex m; age 16; type 3")
v100k <- rep(vec, ceiling(100000/length(vec)))
v1M <- rep(vec, ceiling(1000000/length(vec)))
The approaches we want to test:
library(iotools)
library(data.table)
funAM_l <- function(invec) {
melt(data.table(ind = seq_along(invec), trimws(mstrsplit(invec, ";"))), "ind", na.rm = TRUE)[
, c("key", "val") := tstrsplit(value, " ", TRUE)][
, c("variable", "value") := NULL][]
}
funAM_w <- function(invec) dcast(funAM_l(invec), ind ~ key, value.var = "val")
funMT <- function(v) {
z <- strsplit(v, split = "(\\;)(\\s+)?", perl = TRUE)
lapply(z,function(s) {v <- unlist(strsplit(s,' ')); setNames(as.list(v[c(F,T)]),v[c(T,F)]) })
}
funF <- function(invec) rbindlist(lapply(invec, function(x) { fread(gsub(";", "\n", x)) }), idcol = TRUE)
funGG <- function(invec) read.dcf(textConnection(sub(" ",": ",trimws(unlist(strsplit(paste0(invec, ";"),";"))))))
My suggestion isn't going to win any races with a small vector:
library(microbenchmark)
microbenchmark(funAM_l(vec), funAM_w(vec), funF(vec), funGG(vec), funMT(vec))
# Unit: microseconds
# expr min lq mean median uq max neval
# funAM_l(vec) 1474.163 1525.3765 1614.28414 1573.6325 1601.3815 2828.481 100
# funAM_w(vec) 3293.376 3482.9510 3741.30381 3553.7240 3714.1730 6787.863 100
# funF(vec) 690.761 729.4900 830.61645 756.4610 777.6725 4083.904 100
# funGG(vec) 182.281 209.8405 220.46376 220.8055 232.1820 280.788 100
# funMT(vec) 57.288 76.5225 84.81496 83.2755 90.3120 166.352 100
But look at what happens when we scale up the vectors:
system.time(funAM_l(v100k))
# user system elapsed
# 0.24 0.00 0.24
system.time(funAM_w(v100k))
# user system elapsed
# 0.296 0.000 0.296
system.time(funMT(v100k))
# user system elapsed
# 1.768 0.000 1.768
system.time(funF(v100k))
# user system elapsed
# 21.960 0.136 22.068
system.time(funGG(v100k))
# user system elapsed
# 30.968 0.004 30.940
Here's how it performs on a vector of length 1 million.
system.time(funAM_w(v1M))
# user system elapsed
# 4.316 0.092 4.402
My other suggestion was going to be to look at cSplit
from my "splitstackshape" package. That is a little bit better performing than @Marat's approach.
Here it is on 1 million values:
library(splitstackshape)
system.time(dcast(
cSplit(cSplit(data.table(ind = seq_along(v1M), v1M), "v1M", ";", "long"), "v1M", " "),
ind ~ v1M_1, value.var = "v1M_2"))
# user system elapsed
# 13.744 0.156 13.882
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With