Suppose I have vector vec <- c("D","B","B","C","C")
.
My objective is to end up with a list of dimension length(unique(vec))
, where each i
of this list returns a vector of indices which denote the locations of unique(vec)[i]
in vec
.
For example, this list for vec
would return:
exampleList <- list()
exampleList[[1]] <- c(1) #Since "D" is the first element
exampleList[[2]] <- c(2,3) #Since "B" is the 2nd/3rd element.
exampleList[[3]] <- c(4,5) #Since "C" is the 4th/5th element.
I tried the following approach but it's too slow. My example is large so I need faster code:
vec <- c("D","B","B","C","C")
uniques <- unique(vec)
exampleList <- lapply(1:3,function(i) {
which(vec==uniques[i])
})
exampleList
Update: The behaviour DT[, list(list(.)), by=.]
sometimes resulted in wrong results in R version >= 3.1.0. This is now fixed in commit #1280 in the current development version of data.table v1.9.3. From NEWS:
DT[, list(list(.)), by=.]
returns correct results in R >=3.1.0 as well. The bug was due to recent (welcoming) changes in R v3.1.0 wherelist(.)
does not result in a copy. Closes #481.
Using data.table
is about 15x faster than tapply
:
library(data.table)
vec <- c("D","B","B","C","C")
dt = as.data.table(vec)[, list(list(.I)), by = vec]
dt
# vec V1
#1: D 1
#2: B 2,3
#3: C 4,5
# to get it in the desired format
# (perhaps in the future data.table's setnames will work for lists instead)
setattr(dt$V1, 'names', dt$vec)
dt$V1
#$D
#[1] 1
#
#$B
#[1] 2 3
#
#$C
#[1] 4 5
Speed tests:
vec = sample(letters, 1e7, T)
system.time(tapply(seq_along(vec), vec, identity)[unique(vec)])
# user system elapsed
# 7.92 0.35 8.50
system.time({dt = as.data.table(vec)[, list(list(.I)), by = vec]; setattr(dt$V1, 'names', dt$vec); dt$V1})
# user system elapsed
# 0.39 0.09 0.49
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With