Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to get location of match between vectors

Tags:

r

I am in need of efficiency for finding the indexes (not the logical vector) between two vectors. I can do this with:

which(c("a", "q", "f", "c", "z") %in% letters[1:10])

In the same way it is better to find the position of the maximum number with which.max:

which(c(1:8, 10, 9) %in% max(c(1:8, 10, 9)))
which.max(c(1:8, 10, 9))

I am wondering if I have the most efficient way of finding the position of matching terms in the 2 vectors.

EDIT: Per the questions/comments below. I am operating on a list of vectors. The problem involves operating on sentences that have been broken into a bag of words as seen below. The list may contain 10000-20000 or more character vectors. Then based on that index I will grab 4 words before and 4 words after the index and calculate a score.

x <- list(c('I', 'like', 'chocolate', 'cake'), c('chocolate', 'cake', 'is', 'good'))
y <- rep(x, 5000)

lapply(y, function(x) {
    which(x %in% c("chocolate", "good"))
})
like image 370
Tyler Rinker Avatar asked Aug 11 '13 13:08

Tyler Rinker


2 Answers

Here's a relatively faster way using data.table:

require(data.table)
vv <- vapply(y, length, 0L)
DT <- data.table(y = unlist(y), id = rep(seq_along(y), vv), pos = sequence(vv))
setkey(DT, y)
# OLD CODE which will not take care of no-match entries (commented)
# DT[J(c("chocolate", "good")), list(list(pos)), by=id]$V1

setkey(DT[J(c("chocolate", "good"))], id)[J(seq_along(vv)), list(list(pos))]$V1

The idea:

First we unlist your list into a column of DT named y. In addition, we create two other columns named id and pos. id tells the index in the list and pos tells the position within that id. Then, by creating a key column on id, we can do fast subsetting. With this subsetting we'll get corresponding pos values for each id. Before we collect all pos for each id in a list and then just output the list column (V1), we take care of those entries where there was no match for our query by setting key to id after first subsetting and subsetting on all possible values of id (as this'll result in NA for non-existing entries.


Benchmarking with the lapply code on your post:

x <- list(c('I', 'like', 'chocolate', 'cake'), c('chocolate', 'cake', 'is', 'good'))
y <- rep(x, 5000)

require(data.table)
arun <- function() {
    vv <- vapply(y, length, 0L)
    DT <- data.table(y = unlist(y), id = rep(seq_along(y), vv), pos = sequence(vv))
    setkey(DT, y)
    setkey(DT[J(c("chocolate", "good"))], id)[J(seq_along(vv)), list(list(pos))]$V1
}

tyler <- function() {
    lapply(y, function(x) {
        which(x %in% c("chocolate", "good"))
    })
}

require(microbenchmark)
microbenchmark(a1 <- arun(), a2 <- tyler(), times=50)

Unit: milliseconds
          expr       min        lq    median        uq       max neval
  a1 <- arun()  30.71514  31.92836  33.19569  39.31539  88.56282    50
 a2 <- tyler() 626.67841 669.71151 726.78236 785.86444 955.55803    50

> identical(a1, a2)
# [1] TRUE
like image 164
Arun Avatar answered Nov 13 '22 09:11

Arun


The C++ answer was faster comparing single characters, but I think using a vector of strings introduced enough overhead that now it's slower:

char1 <- c("a", "q", "f", "c", "z")
char2 <- letters[1:10]

library(inline)
cpp_whichin_src <- '
Rcpp::CharacterVector xa(a);
Rcpp::CharacterVector xb(b);
int n_xa = xa.size();
int n_xb = xb.size();

NumericVector res(n_xa);

std::vector<std::string> sa = Rcpp::as< std::vector<std::string> >(xa);
std::vector<std::string> sb = Rcpp::as< std::vector<std::string> >(xb);

for(int i=0; i < n_xa; i++) {
  for(int j=0; j<n_xb; j++) {
    if( sa[i] == sb[j] ) res[i] = i+1;
  }
}
return res;
'
cpp_whichin <- cxxfunction(signature(a="character",b="character"), cpp_whichin_src, plugin="Rcpp")

which.in_cpp <- function(char1, char2) {
  idx <- cpp_whichin(char1,char2)
  idx[idx!=0]
}

which.in_naive <- function(char1, char2) {
  which(char1 %in% char2)
}

which.in_CW <- function(char1, char2) {
  unlist(sapply(char2,function(x) which(x==char1)))
}

which.in_cpp(char1,char2)
which.in_naive(char1,char2)
which.in_CW(char1,char2)

** Benchmarks **

library(microbenchmark)
microbenchmark(
  which.in_cpp(char1,char2),
  which.in_naive(char1,char2),
  which.in_CW(char1,char2)
)

set.seed(1)
cmb <- apply(combn(letters,2), 2, paste,collapse="")
char1 <- sample( cmb, 100 )
char2 <- sample( cmb, 100 )

Unit: microseconds
                          expr     min      lq   median       uq      max
1   which.in_cpp(char1, char2) 114.890 120.023 126.6930 135.5630  537.011
2    which.in_CW(char1, char2) 697.505 725.826 766.4385 813.8615 8032.168
3 which.in_naive(char1, char2)  17.391  20.289  22.4545  25.4230   76.826

# Same as above, but with 3 letter combos and 1000 sampled

Unit: microseconds
                          expr       min        lq     median        uq       max
1   which.in_cpp(char1, char2)  8505.830  8715.598  8863.3130  8997.478  9796.288
2    which.in_CW(char1, char2) 23430.493 27987.393 28871.2340 30032.450 31926.546
3 which.in_naive(char1, char2)   129.904   135.736   158.1905   180.260  3821.785
like image 2
Ari B. Friedman Avatar answered Nov 13 '22 09:11

Ari B. Friedman