Is there a function in R that efficiently checks if a value is larger than one and smaller than another number? It should work with vectors, too.
Essentially, I'm looking for a faster version of the following function:
> in.interval <- function(x, lo, hi) (x > lo & x < hi)
> in.interval(c(2,4,6), 3, 5)
[1] FALSE TRUE FALSE
The problem here is that x
has to be touched twice, and the computation consumes twice the memory compared to a more efficient approach. Internally, I would assume it to work like this:
tmp1 <- (x > lo)
tmp2 <- (x < hi)
retval <- tmp1 & tmp2
Now, after step 2, two Boolean vectors are in memory, and the x
had to be looked at twice. My question is: Is there a (built-in?) function that does all of this in one step, without allocating the extra memory?
Following up this question: R: Select values from data table in range
EDIT: I have set up a Gist based on CauchyDistributedRV's answer at https://gist.github.com/4344844
As @James said in the comments, the trick is to substract the middle between low and high from x, and then check whether that difference is less than half of the distance between low and high. Or, in code :
in.interval2 <- function(x, lo, hi) {
abs(x-(hi+lo)/2) < (hi-lo)/2
}
That's about as fast as the .bincode
hack, and is the implementation of the algorithm you were looking for. You can translate this to C or C++ and try if you get a speedup.
Comparison with other solutions:
x <- runif(1e6,1,10)
require(rbenchmark)
benchmark(
in.interval(x, 3, 5),
in.interval2(x, 3, 5),
findInterval(x, c(3, 5)) == 1,
!is.na(.bincode(x, c(3, 5))),
order='relative',
columns=c("test", "replications", "elapsed", "relative")
)
gives
test replications elapsed relative
4 !is.na(.bincode(x, c(3, 5))) 100 1.88 1.000
2 in.interval2(x, 3, 5) 100 1.95 1.037
3 findInterval(x, c(3, 5)) == 1 100 3.42 1.819
1 in.interval(x, 3, 5) 100 3.54 1.883
findInterval
is faster than in.interval
for long x.
library(microbenchmark)
set.seed(123L)
x <- runif(1e6, 1, 10)
in.interval <- function(x, lo, hi) (x > lo & x < hi)
microbenchmark(
findInterval(x, c(3, 5)) == 1L,
in.interval(x, 3, 5),
times=100)
with
Unit: milliseconds
expr min lq median uq max
1 findInterval(x, c(3, 5)) == 1L 23.40665 25.13308 25.17272 25.25361 27.04032
2 in.interval(x, 3, 5) 42.91647 45.51040 45.60424 45.75144 46.38389
Faster still if == 1L
is not needed, and useful if the 'intervals' to be found are more than 1
> system.time(findInterval(x, 0:10))
user system elapsed
3.644 0.112 3.763
If speed is of the essence, this C implementation is fast though intolerant of, e.g., integer rather than numeric arguments
library(inline)
in.interval_c <- cfunction(c(x="numeric", lo="numeric", hi="numeric"),
' int len = Rf_length(x);
double lower = REAL(lo)[0], upper = REAL(hi)[0],
*xp = REAL(x);
SEXP out = PROTECT(NEW_LOGICAL(len));
int *outp = LOGICAL(out);
for (int i = 0; i < len; ++i)
outp[i] = (xp[i] - lower) * (xp[i] - upper) <= 0;
UNPROTECT(1);
return out;')
Timings for some solutions presented in other answers are
microbenchmark(
findInterval(x, c(3, 5)) == 1L,
in.interval.abs(x, 3, 5),
in.interval(x, 3, 5),
in.interval_c(x, 3, 5),
!is.na(.bincode(x, c(3, 5))),
times=100)
with
Unit: milliseconds
expr min lq median uq
1 findInterval(x, c(3, 5)) == 1L 23.419117 23.495943 23.556524 23.670907
2 in.interval.abs(x, 3, 5) 12.018486 12.056290 12.093279 12.161213
3 in.interval_c(x, 3, 5) 1.619649 1.641119 1.651007 1.679531
4 in.interval(x, 3, 5) 42.946318 43.050058 43.171480 43.407930
5 !is.na(.bincode(x, c(3, 5))) 15.421340 15.468946 15.520298 15.600758
max
1 26.360845
2 13.178126
3 2.785939
4 46.187129
5 18.558425
Revisiting the speed issue, in a file bin.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
SEXP bin1(SEXP x, SEXP lo, SEXP hi)
{
const int len = Rf_length(x);
const double lower = REAL(lo)[0], upper = REAL(hi)[0];
SEXP out = PROTECT(Rf_allocVector(LGLSXP, len));
double *xp = REAL(x);
int *outp = LOGICAL(out);
for (int i = 0; i < len; ++i)
outp[i] = (xp[i] - lower) * (xp[i] - upper) <= 0;
UNPROTECT(1);
return out;
}
// [[Rcpp::export]]
LogicalVector bin2(NumericVector x, NumericVector lo, NumericVector hi)
{
NumericVector xx(x);
double lower = as<double>(lo);
double upper = as<double>(hi);
LogicalVector out(x);
for( int i=0; i < out.size(); i++ )
out[i] = ( (xx[i]-lower) * (xx[i]-upper) ) <= 0;
return out;
}
// [[Rcpp::export]]
LogicalVector bin3(NumericVector x, const double lower, const double upper)
{
const int len = x.size();
LogicalVector out(len);
for (int i=0; i < len; i++)
out[i] = ( (x[i]-lower) * (x[i]-upper) ) <= 0;
return out;
}
with timings
> library(Rcpp)
> sourceCpp("bin.cpp")
> microbenchmark(bin1(x, 3, 5), bin2(x, 3, 5), bin3(x, 3, 5),
+ in.interval_c(x, 3, 5), times=1000)
Unit: milliseconds
expr min lq median uq max
1 bin1(x, 3, 5) 1.546703 2.668171 2.785255 2.839225 144.9574
2 bin2(x, 3, 5) 12.547456 13.583808 13.674477 13.792773 155.6594
3 bin3(x, 3, 5) 2.238139 3.318293 3.357271 3.540876 144.1249
4 in.interval_c(x, 3, 5) 1.545139 2.654809 2.767784 2.822722 143.7500
with about equal parts speed-up coming from use of a constant len
instead of out.size()
as the loop bound, and allocating the logical vector without initializing it (LogicalVector(len)
, since it will be initialized in the loop).
If you can deal with NA
s, you could use .bincode
:
.bincode(c(2,4,6), c(3, 5))
[1] NA 1 NA
library(microbenchmark)
set.seed(42)
x = runif(1e8, 1, 10)
microbenchmark(in.interval(x, 3, 5),
findInterval(x, c(3, 5)),
.bincode(x, c(3, 5)),
times=5)
Unit: milliseconds
expr min lq median uq max
1 .bincode(x, c(3, 5)) 930.4842 934.3594 955.9276 1002.857 1047.348
2 findInterval(x, c(3, 5)) 1438.4620 1445.7131 1472.4287 1481.380 1551.419
3 in.interval(x, 3, 5) 2977.8460 3046.7720 3075.8381 3182.013 3288.020
The main speedup I can find is through byte-compiling the function. Even an Rcpp solution (albeit using Rcpp sugar, and not a more drilled-down C solution) is slower than the compiled solution.
library( compiler )
library( microbenchmark )
library( inline )
in.interval <- function(x, lo, hi) (x > lo & x < hi)
in.interval2 <- cmpfun( in.interval )
in.interval3 <- function(x, lo, hi) {
sapply( x, function(xx) {
xx > lo && xx < hi }
)
}
in.interval4 <- cmpfun( in.interval3 )
in.interval5 <- rcpp( signature(x="numeric", lo="numeric", hi="numeric"), '
NumericVector xx(x);
double lower = Rcpp::as<double>(lo);
double upper = Rcpp::as<double>(hi);
return Rcpp::wrap( xx > lower & xx < upper );
')
x <- c(2, 4, 6)
lo <- 3
hi <- 5
microbenchmark(
in.interval(x, lo, hi),
in.interval2(x, lo, hi),
in.interval3(x, lo, hi),
in.interval4(x, lo, hi),
in.interval5(x, lo, hi)
)
gives me
Unit: microseconds
expr min lq median uq max
1 in.interval(x, lo, hi) 1.575 2.0785 2.5025 2.6560 7.490
2 in.interval2(x, lo, hi) 1.035 1.4230 1.6800 2.0705 11.246
3 in.interval3(x, lo, hi) 25.439 26.2320 26.7350 27.2250 77.541
4 in.interval4(x, lo, hi) 22.479 23.3920 23.8395 24.3725 33.770
5 in.interval5(x, lo, hi) 1.425 1.8740 2.2980 2.5565 21.598
library( compiler )
library( inline )
library( microbenchmark )
in.interval.oldRcpp <- rcpp(
signature(x="numeric", lo="numeric", hi="numeric"), '
NumericVector xx(x);
double lower = Rcpp::as<double>(lo);
double upper = Rcpp::as<double>(hi);
return Rcpp::wrap( (xx > lower) & (xx < upper) );
')
in.interval.abs <- rcpp(
signature(x="numeric", lo="numeric", hi="numeric"), '
NumericVector xx(x);
double lower = as<double>(lo);
double upper = as<double>(hi);
LogicalVector out(x);
for( int i=0; i < out.size(); i++ ) {
out[i] = ( (xx[i]-lower) * (xx[i]-upper) ) <= 0;
}
return wrap(out);
')
in.interval.abs.sugar <- rcpp(
signature( x="numeric", lo="numeric", hi="numeric"), '
NumericVector xx(x);
double lower = as<double>(lo);
double upper = as<double>(hi);
return wrap( ((xx-lower) * (xx-upper)) <= 0 );
')
x <- runif(1E5)
lo <- 0.5
hi <- 1
microbenchmark(
in.interval.oldRcpp(x, lo, hi),
in.interval.abs(x, lo, hi),
in.interval.abs.sugar(x, lo, hi)
)
all.equal( in.interval.oldRcpp(x, lo, hi), in.interval.abs(x, lo, hi) )
all.equal( in.interval.oldRcpp(x, lo, hi), in.interval.abs.sugar(x, lo, hi) )
gives me
1 in.interval.abs(x, lo, hi) 662.732 666.4855 669.939 690.6585 1580.707
2 in.interval.abs.sugar(x, lo, hi) 722.789 726.0920 728.795 742.6085 1671.093
3 in.interval.oldRcpp(x, lo, hi) 1870.784 1876.4890 1892.854 1935.0445 2859.025
> all.equal( in.interval.oldRcpp(x, lo, hi), in.interval.abs(x, lo, hi) )
[1] TRUE
> all.equal( in.interval.oldRcpp(x, lo, hi), in.interval.abs.sugar(x, lo, hi) )
[1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With