I often run into situations where I need to check if some condition holds for any or all elements of a very large vector or list. For example to check if a list contains any/only NULL
elements I would use:
any(vapply(x, is.null, logical(1))
all(vapply(x, is.null, logical(1))
However this is inefficient because it always checks each element in the list. A smarter implementation would stop checking when the first NULL
or non NULL
element was found. I.e. the equivalent of:
is.null(x[[1]]) || is.null(x[[2]]) || is.null(x[[3]]) || ...
is.null(x[[1]]) && is.null(x[[2]]) && is.null(x[[3]]) && ...
Doing this with a for
loop is slow. There are some special cases provided by r-base, for example anyNA
is an efficient version of any(is.na(.))
that does exactly this. But I was wondering if we could implement this more generally and provide an optimized functions for checking a condition:
all_fast(x, is.null)
any_fast(x, is.null)
But also:
all_fast(x, function(z) {length(z) == 2})
all_fast(x, is, "POSIXt")
Here's the naive way,
all0 <- function(x, FUN)
all(vapply(x, FUN, logical(1)))
and an R loop...
all1 <- function(x, FUN) {
for (xi in x)
if (!FUN(xi))
return(FALSE)
TRUE
}
...which can be compiled
library(compiler)
all1c <- cmpfun(all1)
...or written in C
library(inline)
allc <- cfunction(signature(x="list", fun="function"), "
SEXP call = PROTECT(lang2(fun, R_NilValue));
int len = Rf_length(x);
for (int i = 0; i < len; ++i) {
SETCADR(call, VECTOR_ELT(x, i));
if (!LOGICAL(eval(call, R_GlobalEnv))[0]) {
UNPROTECT(1);
return Rf_ScalarLogical(FALSE);
}
}
UNPROTECT(1);
return Rf_ScalarLogical(TRUE);")
We need to measure the performance, so
library(microbenchmark)
The worst case would seem to be that the condition passes
n <- 100000
x0 <- x <- vector("list", n)
microbenchmark(all0(x, is.null), all1(x, is.null), all1c(x, is.null),
allc(x, is.null))
## Unit: milliseconds
## expr min lq median uq max neval
## all0(x, is.null) 47.48038 50.58960 52.34946 54.10116 61.94736 100
## all1(x, is.null) 41.52370 44.40024 45.25135 46.68218 53.22317 100
## all1c(x, is.null) 33.76666 35.03008 35.71738 36.41944 45.37174 100
## allc(x, is.null) 13.95340 14.43153 14.78244 15.94688 19.41072 100
so we're only 2x as fast in C compared to the compiled R version -- there is a function call on each test, so we're only saving on the looping per se. The best case is when we exit immediately and clearly showing the advantage of the loop, but then neither compilation nor C code helps us
x[[1]] <- FALSE
microbenchmark(all0(x, is.null), all1(x, is.null), all1c(x, is.null),
allc(x, is.null))
## Unit: microseconds
## expr min lq median uq max neval
## all0(x, is.null) 45376.760 45772.5020 46108.5795 46655.005 54242.687 100
## all1(x, is.null) 1.566 1.9550 2.6335 12.015 14.177 100
## all1c(x, is.null) 1.367 1.7340 2.0345 9.359 17.438 100
## allc(x, is.null) 1.229 1.6925 4.6955 11.628 23.378 100
Here's an intermediate case, which does not really contain any surprises -- the C loop is about 2x faster than the compiled R loop, so gets there about 2x as quickly.
x <- x0
x[[length(x)/2]] <- FALSE
microbenchmark(all0(x, is.null), all1(x, is.null), all1c(x, is.null),
allc(x, is.null))
## Unit: milliseconds
## expr min lq median uq max neval
## all0(x, is.null) 46.85690 49.92969 51.045519 52.653137 59.445611 100
## all1(x, is.null) 20.90066 21.92357 22.582636 23.077863 25.974395 100
## all1c(x, is.null) 16.51897 17.44539 17.825551 18.119202 20.535709 100
## allc(x, is.null) 6.98468 7.18392 7.312575 8.290859 9.460558 100
Explicitly testing for NULL at the C level (VECTOR_ELT(x, i) == R_NilValue
) is very fast, so C code that compares the value to NULL is about 100x faster than the corresponding R code. It would seem that allNULL might be a worth-while generalization if speed were of the essence, but the case for a general-purpose C-level all doesn't seem so compelling. And of course the C code doesn't deal with NA or error conditions.
Jeroen rightly says that
However this is inefficient because it always checks each element in the list. A smarter implementation would stop checking when the first NULL or non NULL element was found.
and the Rcpp sugar versions have been doing that for a few years. I have a benchmark comparison somewhere.
Edit: Found it, it is a really old example that predates our use of the rbenchmark
or microbenchmark
packages, and it is still in the Rcpp package in the examples/SugarPerformance
directory. When I run it now, the relevant line is (and edited to fit line here)
runs expr hand.written sugar R hnd/sugar R/sugar
1 5000 any(x * y < 0) 0.000128746 0.000232458 7.52280 0.553846 32361.9631
We used this in a lot of early talks as the "gain" seems so impressive. But even a single R run is just 0.15 milliseconds so unless you really it repeatedly it is not worth the gain.
And as Martin shows in his answer, just byte-compiling (which wasn't available yet either when we set the example up in early 2010) is helpful too.
The 'any' version:
res <- FALSE
for ( i in seq_along(x) ) { if( is.null(x[i]) ) { res <-TRUE; break()}
res
lapply
and vapply
are just for-loops internally, so you are only loosing the syntactic compression that they offer, but you are gaining the ability to break out of the loop on the first occurrence of a defining condition. You could use res <- TRUE
and set to FALSE
for the 'all' version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With