Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient versions of any/all

I often run into situations where I need to check if some condition holds for any or all elements of a very large vector or list. For example to check if a list contains any/only NULL elements I would use:

any(vapply(x, is.null, logical(1))
all(vapply(x, is.null, logical(1))

However this is inefficient because it always checks each element in the list. A smarter implementation would stop checking when the first NULL or non NULL element was found. I.e. the equivalent of:

is.null(x[[1]]) || is.null(x[[2]]) || is.null(x[[3]]) || ...
is.null(x[[1]]) && is.null(x[[2]]) && is.null(x[[3]]) && ...

Doing this with a for loop is slow. There are some special cases provided by r-base, for example anyNA is an efficient version of any(is.na(.)) that does exactly this. But I was wondering if we could implement this more generally and provide an optimized functions for checking a condition:

all_fast(x, is.null)
any_fast(x, is.null)

But also:

all_fast(x, function(z) {length(z) == 2})
all_fast(x, is, "POSIXt")
like image 350
Jeroen Ooms Avatar asked Apr 17 '14 20:04

Jeroen Ooms


3 Answers

Here's the naive way,

all0 <- function(x, FUN)
    all(vapply(x, FUN, logical(1)))

and an R loop...

all1 <- function(x, FUN) {
    for (xi in x)
        if (!FUN(xi))
            return(FALSE)
    TRUE
}

...which can be compiled

library(compiler)
all1c <- cmpfun(all1)

...or written in C

library(inline)
allc <- cfunction(signature(x="list", fun="function"), "
    SEXP call = PROTECT(lang2(fun, R_NilValue));
    int len = Rf_length(x);
    for (int i = 0; i < len; ++i) {
        SETCADR(call, VECTOR_ELT(x, i));
        if (!LOGICAL(eval(call, R_GlobalEnv))[0]) {
            UNPROTECT(1);
            return Rf_ScalarLogical(FALSE);
        }
    }
    UNPROTECT(1);
    return Rf_ScalarLogical(TRUE);")

We need to measure the performance, so

library(microbenchmark)

The worst case would seem to be that the condition passes

n <- 100000
x0 <- x <- vector("list", n)
microbenchmark(all0(x, is.null), all1(x, is.null), all1c(x, is.null),
               allc(x, is.null))
## Unit: milliseconds
##               expr      min       lq   median       uq      max neval
##   all0(x, is.null) 47.48038 50.58960 52.34946 54.10116 61.94736   100
##   all1(x, is.null) 41.52370 44.40024 45.25135 46.68218 53.22317   100
##  all1c(x, is.null) 33.76666 35.03008 35.71738 36.41944 45.37174   100
##   allc(x, is.null) 13.95340 14.43153 14.78244 15.94688 19.41072   100

so we're only 2x as fast in C compared to the compiled R version -- there is a function call on each test, so we're only saving on the looping per se. The best case is when we exit immediately and clearly showing the advantage of the loop, but then neither compilation nor C code helps us

x[[1]] <- FALSE
microbenchmark(all0(x, is.null), all1(x, is.null), all1c(x, is.null),
               allc(x, is.null))
## Unit: microseconds
##               expr       min         lq     median        uq       max neval
##   all0(x, is.null) 45376.760 45772.5020 46108.5795 46655.005 54242.687   100
##   all1(x, is.null)     1.566     1.9550     2.6335    12.015    14.177   100
##  all1c(x, is.null)     1.367     1.7340     2.0345     9.359    17.438   100
##   allc(x, is.null)     1.229     1.6925     4.6955    11.628    23.378   100

Here's an intermediate case, which does not really contain any surprises -- the C loop is about 2x faster than the compiled R loop, so gets there about 2x as quickly.

x <- x0
x[[length(x)/2]] <- FALSE
microbenchmark(all0(x, is.null), all1(x, is.null), all1c(x, is.null),
               allc(x, is.null))
## Unit: milliseconds
##               expr      min       lq    median        uq       max neval
##   all0(x, is.null) 46.85690 49.92969 51.045519 52.653137 59.445611   100
##   all1(x, is.null) 20.90066 21.92357 22.582636 23.077863 25.974395   100
##  all1c(x, is.null) 16.51897 17.44539 17.825551 18.119202 20.535709   100
##   allc(x, is.null)  6.98468  7.18392  7.312575  8.290859  9.460558   100

Explicitly testing for NULL at the C level (VECTOR_ELT(x, i) == R_NilValue) is very fast, so C code that compares the value to NULL is about 100x faster than the corresponding R code. It would seem that allNULL might be a worth-while generalization if speed were of the essence, but the case for a general-purpose C-level all doesn't seem so compelling. And of course the C code doesn't deal with NA or error conditions.

like image 195
Martin Morgan Avatar answered Oct 18 '22 11:10

Martin Morgan


Jeroen rightly says that

However this is inefficient because it always checks each element in the list. A smarter implementation would stop checking when the first NULL or non NULL element was found.

and the Rcpp sugar versions have been doing that for a few years. I have a benchmark comparison somewhere.

Edit: Found it, it is a really old example that predates our use of the rbenchmark or microbenchmark packages, and it is still in the Rcpp package in the examples/SugarPerformance directory. When I run it now, the relevant line is (and edited to fit line here)

  runs              expr hand.written       sugar        R hnd/sugar    R/sugar
1 5000    any(x * y < 0)  0.000128746 0.000232458  7.52280  0.553846 32361.9631

We used this in a lot of early talks as the "gain" seems so impressive. But even a single R run is just 0.15 milliseconds so unless you really it repeatedly it is not worth the gain.

And as Martin shows in his answer, just byte-compiling (which wasn't available yet either when we set the example up in early 2010) is helpful too.

like image 2
Dirk Eddelbuettel Avatar answered Oct 18 '22 11:10

Dirk Eddelbuettel


The 'any' version:

res <- FALSE
for ( i in seq_along(x) ) { if( is.null(x[i]) ) { res <-TRUE; break()} 
res

lapply and vapply are just for-loops internally, so you are only loosing the syntactic compression that they offer, but you are gaining the ability to break out of the loop on the first occurrence of a defining condition. You could use res <- TRUE and set to FALSE for the 'all' version.

like image 1
IRTFM Avatar answered Oct 18 '22 12:10

IRTFM