Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast vectorized function to check if a value is in an interval

Is there a function in R that efficiently checks if a value is larger than one and smaller than another number? It should work with vectors, too.

Essentially, I'm looking for a faster version of the following function:

> in.interval <- function(x, lo, hi) (x > lo & x < hi)
> in.interval(c(2,4,6), 3, 5)
[1] FALSE  TRUE FALSE

The problem here is that x has to be touched twice, and the computation consumes twice the memory compared to a more efficient approach. Internally, I would assume it to work like this:

  1. Compute tmp1 <- (x > lo)
  2. Compute tmp2 <- (x < hi)
  3. Compute retval <- tmp1 & tmp2

Now, after step 2, two Boolean vectors are in memory, and the x had to be looked at twice. My question is: Is there a (built-in?) function that does all of this in one step, without allocating the extra memory?

Following up this question: R: Select values from data table in range

EDIT: I have set up a Gist based on CauchyDistributedRV's answer at https://gist.github.com/4344844

like image 318
krlmlr Avatar asked Dec 20 '12 10:12

krlmlr


4 Answers

As @James said in the comments, the trick is to substract the middle between low and high from x, and then check whether that difference is less than half of the distance between low and high. Or, in code :

in.interval2 <- function(x, lo, hi) {
    abs(x-(hi+lo)/2) < (hi-lo)/2 
}

That's about as fast as the .bincode hack, and is the implementation of the algorithm you were looking for. You can translate this to C or C++ and try if you get a speedup.

Comparison with other solutions:

x <- runif(1e6,1,10)
require(rbenchmark)
benchmark(
  in.interval(x, 3, 5),
  in.interval2(x, 3, 5),
  findInterval(x, c(3, 5)) == 1,
  !is.na(.bincode(x, c(3, 5))),
  order='relative',
  columns=c("test", "replications", "elapsed", "relative")
) 

gives

                           test replications elapsed relative
4  !is.na(.bincode(x, c(3, 5)))          100    1.88    1.000
2         in.interval2(x, 3, 5)          100    1.95    1.037
3 findInterval(x, c(3, 5)) == 1          100    3.42    1.819
1          in.interval(x, 3, 5)          100    3.54    1.883
like image 167
Joris Meys Avatar answered Nov 07 '22 02:11

Joris Meys


findInterval is faster than in.interval for long x.

library(microbenchmark)

set.seed(123L)
x <- runif(1e6, 1, 10)
in.interval <- function(x, lo, hi) (x > lo & x < hi)

microbenchmark(
    findInterval(x, c(3, 5)) == 1L,
    in.interval(x, 3, 5),
    times=100)

with

Unit: milliseconds
                            expr      min       lq   median       uq      max
1 findInterval(x, c(3, 5)) == 1L 23.40665 25.13308 25.17272 25.25361 27.04032
2           in.interval(x, 3, 5) 42.91647 45.51040 45.60424 45.75144 46.38389

Faster still if == 1L is not needed, and useful if the 'intervals' to be found are more than 1

> system.time(findInterval(x, 0:10))
   user  system elapsed 
  3.644   0.112   3.763 

If speed is of the essence, this C implementation is fast though intolerant of, e.g., integer rather than numeric arguments

library(inline)
in.interval_c <- cfunction(c(x="numeric", lo="numeric", hi="numeric"),
'    int len = Rf_length(x);
     double lower = REAL(lo)[0], upper = REAL(hi)[0],
            *xp = REAL(x);
     SEXP out = PROTECT(NEW_LOGICAL(len));
     int *outp = LOGICAL(out);

     for (int i = 0; i < len; ++i)
         outp[i] = (xp[i] - lower) * (xp[i] - upper) <= 0;

     UNPROTECT(1);
     return out;')

Timings for some solutions presented in other answers are

microbenchmark(
    findInterval(x, c(3, 5)) == 1L,
    in.interval.abs(x, 3, 5),
    in.interval(x, 3, 5),
    in.interval_c(x, 3, 5),
    !is.na(.bincode(x, c(3, 5))),
    times=100)

with

Unit: milliseconds
                            expr       min        lq    median        uq
1 findInterval(x, c(3, 5)) == 1L 23.419117 23.495943 23.556524 23.670907
2       in.interval.abs(x, 3, 5) 12.018486 12.056290 12.093279 12.161213
3         in.interval_c(x, 3, 5)  1.619649  1.641119  1.651007  1.679531
4           in.interval(x, 3, 5) 42.946318 43.050058 43.171480 43.407930
5   !is.na(.bincode(x, c(3, 5))) 15.421340 15.468946 15.520298 15.600758
        max
1 26.360845
2 13.178126
3  2.785939
4 46.187129
5 18.558425

Revisiting the speed issue, in a file bin.cpp

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
SEXP bin1(SEXP x, SEXP lo, SEXP hi)
{
    const int len = Rf_length(x);
    const double lower = REAL(lo)[0], upper = REAL(hi)[0];
    SEXP out = PROTECT(Rf_allocVector(LGLSXP, len));

    double *xp = REAL(x);
    int *outp = LOGICAL(out);
    for (int i = 0; i < len; ++i)
    outp[i] = (xp[i] - lower) * (xp[i] - upper) <= 0;

    UNPROTECT(1);
    return out;
}

// [[Rcpp::export]]
LogicalVector bin2(NumericVector x, NumericVector lo, NumericVector hi)
{
    NumericVector xx(x);
    double lower = as<double>(lo);
    double upper = as<double>(hi); 

    LogicalVector out(x);
    for( int i=0; i < out.size(); i++ )
        out[i] = ( (xx[i]-lower) * (xx[i]-upper) ) <= 0;

    return out;
}

// [[Rcpp::export]]
LogicalVector bin3(NumericVector x, const double lower, const double upper)
{
    const int len = x.size();
    LogicalVector out(len);

    for (int i=0; i < len; i++)
        out[i] = ( (x[i]-lower) * (x[i]-upper) ) <= 0;

    return out;
}

with timings

> library(Rcpp)
> sourceCpp("bin.cpp")
> microbenchmark(bin1(x, 3, 5), bin2(x, 3, 5), bin3(x, 3, 5),                   
+                in.interval_c(x, 3, 5), times=1000)                            
Unit: milliseconds                                                              
                    expr       min        lq    median        uq      max       
1          bin1(x, 3, 5)  1.546703  2.668171  2.785255  2.839225 144.9574       
2          bin2(x, 3, 5) 12.547456 13.583808 13.674477 13.792773 155.6594       
3          bin3(x, 3, 5)  2.238139  3.318293  3.357271  3.540876 144.1249       
4 in.interval_c(x, 3, 5)  1.545139  2.654809  2.767784  2.822722 143.7500       

with about equal parts speed-up coming from use of a constant len instead of out.size() as the loop bound, and allocating the logical vector without initializing it (LogicalVector(len), since it will be initialized in the loop).

like image 38
Martin Morgan Avatar answered Nov 07 '22 04:11

Martin Morgan


If you can deal with NAs, you could use .bincode:

.bincode(c(2,4,6), c(3, 5))
[1] NA  1 NA

library(microbenchmark)
set.seed(42)
x = runif(1e8, 1, 10)
microbenchmark(in.interval(x, 3, 5),
               findInterval(x,  c(3, 5)),
               .bincode(x, c(3, 5)),
               times=5)

Unit: milliseconds
                      expr       min        lq    median       uq      max
1     .bincode(x, c(3, 5))  930.4842  934.3594  955.9276 1002.857 1047.348
2 findInterval(x, c(3, 5)) 1438.4620 1445.7131 1472.4287 1481.380 1551.419
3     in.interval(x, 3, 5) 2977.8460 3046.7720 3075.8381 3182.013 3288.020
like image 39
Roland Avatar answered Nov 07 '22 03:11

Roland


The main speedup I can find is through byte-compiling the function. Even an Rcpp solution (albeit using Rcpp sugar, and not a more drilled-down C solution) is slower than the compiled solution.

library( compiler )
library( microbenchmark )
library( inline )

in.interval <- function(x, lo, hi) (x > lo & x < hi)
in.interval2 <- cmpfun( in.interval )
in.interval3 <- function(x, lo, hi) {
  sapply( x, function(xx) { 
    xx > lo && xx < hi }
          )
}
in.interval4 <- cmpfun( in.interval3 )
in.interval5 <- rcpp( signature(x="numeric", lo="numeric", hi="numeric"), '
NumericVector xx(x);
double lower = Rcpp::as<double>(lo);
double upper = Rcpp::as<double>(hi);

return Rcpp::wrap( xx > lower & xx < upper );
')

x <- c(2, 4, 6)
lo <- 3
hi <- 5

microbenchmark(
  in.interval(x, lo, hi),
  in.interval2(x, lo, hi),
  in.interval3(x, lo, hi),
  in.interval4(x, lo, hi),
  in.interval5(x, lo, hi)
)

gives me

Unit: microseconds
                     expr    min      lq  median      uq    max
1  in.interval(x, lo, hi)  1.575  2.0785  2.5025  2.6560  7.490
2 in.interval2(x, lo, hi)  1.035  1.4230  1.6800  2.0705 11.246
3 in.interval3(x, lo, hi) 25.439 26.2320 26.7350 27.2250 77.541
4 in.interval4(x, lo, hi) 22.479 23.3920 23.8395 24.3725 33.770
5 in.interval5(x, lo, hi)  1.425  1.8740  2.2980  2.5565 21.598


EDIT: Following other comments, here's an even faster Rcpp solution, using the tricks with absolute values given:
library( compiler )
library( inline )
library( microbenchmark )

in.interval.oldRcpp <- rcpp( 
  signature(x="numeric", lo="numeric", hi="numeric"), '
    NumericVector xx(x);
    double lower = Rcpp::as<double>(lo);
    double upper = Rcpp::as<double>(hi);

    return Rcpp::wrap( (xx > lower) & (xx < upper) );
    ')

in.interval.abs <- rcpp( 
  signature(x="numeric", lo="numeric", hi="numeric"), '
    NumericVector xx(x);
    double lower = as<double>(lo);
    double upper = as<double>(hi); 

    LogicalVector out(x);
    for( int i=0; i < out.size(); i++ ) {
      out[i] = ( (xx[i]-lower) * (xx[i]-upper) ) <= 0;
    }
    return wrap(out);
    ')

in.interval.abs.sugar <- rcpp( 
  signature( x="numeric", lo="numeric", hi="numeric"), '
    NumericVector xx(x);
    double lower = as<double>(lo);
    double upper = as<double>(hi); 

    return wrap( ((xx-lower) * (xx-upper)) <= 0 );
    ')

x <- runif(1E5)
lo <- 0.5
hi <- 1

microbenchmark(
  in.interval.oldRcpp(x, lo, hi),
  in.interval.abs(x, lo, hi),
  in.interval.abs.sugar(x, lo, hi)
)

all.equal( in.interval.oldRcpp(x, lo, hi), in.interval.abs(x, lo, hi) )
all.equal( in.interval.oldRcpp(x, lo, hi), in.interval.abs.sugar(x, lo, hi) )

gives me

1       in.interval.abs(x, lo, hi)  662.732  666.4855  669.939  690.6585 1580.707
2 in.interval.abs.sugar(x, lo, hi)  722.789  726.0920  728.795  742.6085 1671.093
3   in.interval.oldRcpp(x, lo, hi) 1870.784 1876.4890 1892.854 1935.0445 2859.025

> all.equal( in.interval.oldRcpp(x, lo, hi), in.interval.abs(x, lo, hi) )
[1] TRUE

> all.equal( in.interval.oldRcpp(x, lo, hi), in.interval.abs.sugar(x, lo, hi) )
[1] TRUE
like image 4
Kevin Ushey Avatar answered Nov 07 '22 03:11

Kevin Ushey