Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to determine if element exists in a vector

I have several algorithms that depend on the efficiency of determining whether an element exists in a vector or not. It seems to me that %in% (which is equivalent to is.element()) should be the most efficient as it simply returns a Boolean value. After testing several methods, to my surprise, those methods are by far the most inefficient. Below is my analysis (the results get worse as the size of the vectors increase):

EfficiencyTest <- function(n, Lim) {

    samp1 <- sample(Lim, n)
    set1 <- sample(Lim, Lim)

    print(system.time(for(i in 1:n) {which(set1==samp1[i])}))
    print(system.time(for(i in 1:n) {samp1[i] %in% set1}))
    print(system.time(for(i in 1:n) {is.element(samp1[i], set1)}))
    print(system.time(for(i in 1:n) {match(samp1[i], set1)}))
    a <- system.time(set1 <- sort(set1))
    b <- system.time(for (i in 1:n) {BinVecCheck(samp1[i], set1)})
    print(a+b)
}

> EfficiencyTest(10^3, 10^5)
user  system elapsed 
0.29    0.11    0.40 
user  system elapsed 
19.79    0.39   20.21 
user  system elapsed 
19.89    0.53   20.44 
user  system elapsed 
20.04    0.28   20.33 
user  system elapsed 
0.02    0.00    0.03 

Where BinVecCheck is a binary search algorithm that I wrote that returns TRUE/FALSE. Note that I include the time it takes to sort the vector with the final method. Here is the code for the binary search:

BinVecCheck <- function(tar, vec) {      
    if (tar==vec[1] || tar==vec[length(vec)]) {return(TRUE)}        
    size <- length(vec)
    size2 <- trunc(size/2)
    dist <- (tar - vec[size2])       
    if (dist > 0) {
        lower <- size2 - 1L
        upper <- size
    } else {
        lower <- 1L
        upper <- size2 + 1L
    }        
    while (size2 > 1 && !(dist==0)) {
        size2 <- trunc((upper-lower)/2)
        temp <- lower+size2
        dist <- (tar - vec[temp])
        if (dist > 0) {
            lower <- temp-1L
        } else {
            upper <- temp+1L
        }
    }       
    if (dist==0) {return(TRUE)} else {return(FALSE)}
}

Platform Info:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Question

Is there a more efficient way of determining whether an element exists in a vector in R? For example, is there an equivalent R function to the Python set function, that greatly improves on this approach? Also, why is %in%, and the like, so inefficient even when compared to the which function which gives more information (not only does it determine existence, but it also gives the indices of all true accounts)?

like image 787
Joseph Wood Avatar asked Oct 31 '15 15:10

Joseph Wood


2 Answers

My tests aren't bearing out all of your claims, but that seems (?) to be due to cross-platform differences (which makes the question even more mysterious, and possibly worth taking up on [email protected], although maybe not since the fastmatch solution below dominates anyway ...)

 n <- 10^3; Lim <- 10^5
 set.seed(101)
 samp1 <- sample(Lim,n)
 set1 <- sample(Lim,Lim)
 library("rbenchmark")

 library("fastmatch")
 `%fin%` <- function(x, table) {
     stopifnot(require(fastmatch))
     fmatch(x, table, nomatch = 0L) > 0L
 }
 benchmark(which=sapply(samp1,function(x) which(set1==x)),
           infun=sapply(samp1,function(x) x %in% set1),
           fin= sapply(samp1,function(x) x %fin% set1),
           brc= sapply(samp1,BinVecCheck,vec=sort(set1)),
           replications=20,
    columns = c("test", "replications", "elapsed", "relative"))

##    test replications elapsed relative
## 4   brc           20   0.871    2.329
## 3   fin           20   0.374    1.000
## 2 infun           20   6.480   17.326
## 1 which           20  10.634   28.433

This says that %in% is about twice as fast as which -- your BinVecCheck function is 7x better, but the fastmatch solution from here gets another factor of 2. I don't know if a specialized Rcpp implementation could do better or not ... In fact, I get different answers even when running your code:

##    user  system elapsed   (which)
##   0.488   0.096   0.586 
##    user  system elapsed   (%in%) 
##   0.184   0.132   0.315 
##    user  system elapsed  (is.element)
##   0.188   0.124   0.313 
##    user  system elapsed  (match)
##   0.148   0.164   0.312 
##    user  system elapsed  (BinVecCheck)
##   0.048   0.008   0.055 

update: on r-devel Peter Dalgaard explains the platform discrepancy (which is an R version difference, not an OS difference) by pointing to the R NEWS entry:

match(x, table) is faster, sometimes by an order of magnitude, when x is of length one and incomparables is unchanged, thanks to Haverty's PR#16491.

sessionInfo()
## R Under development (unstable) (2015-10-23 r69563)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Ubuntu precise (12.04.5 LTS)
like image 167
Ben Bolker Avatar answered Oct 13 '22 00:10

Ben Bolker


%in% is just sugar for match, and is defined as:

"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0

Both match and which are low level (compiled C) functions called by .Internal(). You can actually see the source code by using the pryr package:

install.packages("pryr")
library(pryr)
pryr::show_c_source(.Internal(which(x)))
pryr::show_c_source(.Internal(match(x, table, nomatch, incomparables)))

You would be pointed to this page for which and this page for match. which does not perform any of the casting, checks etc that match performs. This might explain its higher performance in your tests (but I haven't tested your results myself).

like image 24
kliron Avatar answered Oct 12 '22 23:10

kliron