Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get indices of repeated instances of elements of a vector in other vector (both very large)

Tags:

indexing

r

I have two vectors, one (A) of about 100 million non-unique elements (integers), the other (B) of 1 million of the same, unique, elements. I am trying to get a list containing the indices of the repeated instances of each element of B in A.

A <- c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2)
B <- 1:3

# would result in this:
[[1]]
[1] 2 3 4 6 7

[[2]]
[1]  1  5 10

[[3]]
[1] 8 9

I first, naively, tried this:

b_indices <- lapply(B, function(b) which(A == b))

which is horribly inefficient, and apparently wouldn't complete in a few years.

The second thing I tried was to create a list of empty vectors, indexed with all elements of B, and to then loop through A, appending the index to the corresponding vector for each element in A. Although technically O(n), I'm not sure about the time to repeatedly append elements. This approach would apparently take ~ 2-3 days, which is still too slow...

Is there anything that could work faster?

like image 242
Gerome Pistre Avatar asked Dec 04 '22 00:12

Gerome Pistre


1 Answers

This is fast:

A1 <- order(A, method = "radix")

split(A1, A[A1])
#$`1`
#[1] 2 3 4 6 7
#
#$`2`
#[1]  1  5 10
#
#$`3`
#[1] 8 9

B <- seq_len(1e6)
set.seed(42)
A <- sample(B, 1e8, TRUE)

system.time({
  A1 <- order(A, method = "radix")

  res <- split(A1, A[A1])
})
# user      system     elapsed 
#8.650       1.056       9.704
like image 99
Roland Avatar answered Dec 06 '22 14:12

Roland