I have a sorted array of values and a single value like so: <pre class="prettyprint"><code>x <- c(1.0, 3.45, 5.23, 7.3, 12.5, 23.45) v <- 6.45 </code></pre> I can find the index of the value after which <code>v</code> would be inserted into <code>x</code> while maintaining the sorting order: <pre class="prettyprint"><code>max(which(x <= v)) [1] 3 </code></pre> It is nice and compact code, but I have the gut feeling that behind-the-scenes this is really inefficient: since <code>which()</code> does not know that the array is sorted it has to inspect all values. Is there a better way of finding this index value? Note: I am not interested in actually merging <code>v</code> into <code>x</code>. I just want the index value.

If you need a faster version and you don't need to check your inputs you can write an easy C++ function: <pre class="prettyprint"><code>Rcpp::cppFunction( "int foo(double x, const Rcpp::NumericVector& v) { int min = 0; int max = v.size(); while (max - min > 1) { int idx = (min + max) / 2; if (v[idx] > x) { max = idx; } else { min = idx; } } return min + 1; }" ) </code></pre> If you need it, you can check <code>if (x < v[0])</code> by yourself (I don't know what you want to see in this case). And you can test it by using package microbenchmark: <pre class="prettyprint"><code>library(microbenchmark) n = 1e6 v = sort(rnorm(n, 0, 15)) x = runif(1, -15, 15) microbenchmark(max(which(v <= x)), sum(v <= x), findInterval(x, v), foo(x, v)) </code></pre> Result: <img src="https://i.stack.imgur.com/ro5jr.png" alt="Enter image description here">

Benchmark based on Егор-Шишунов's answer: <pre class="prettyprint"><code># Functions: Rcpp::cppFunction( "int Erop(double x, const Rcpp::NumericVector& v) { int min = 0; int max = v.size(); while (max - min > 1) { int idx = (min + max) / 2; if (v[idx] > x) { max = idx; } else { min = idx; } } return min + 1; }" ) Rcpp::cppFunction( "int GKi(double v, const Rcpp::NumericVector& x) { return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v)); }") Rcpp::cppFunction(" Rcpp::IntegerVector GKi2(const Rcpp::NumericVector& v , const Rcpp::NumericVector& x) { Rcpp::IntegerVector res(v.length()); for(int i=0; i < res.length(); ++i) { res[i] = std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v[i])); } return res; }") </code></pre> <pre class="prettyprint"><code># Data: set.seed(42) x <- sort(rnorm(1e6)) v <- sort(c(sample(x, 15), rnorm(15))) </code></pre> <pre class="prettyprint"><code># Result: bench::mark(whichMax= sapply(v, \(v) max(which(x <= v))) , sum = sapply(v, \(v) sum(x<=v)) , findInterval = findInterval(v, x) , Erop = sapply(v, \(v) Erop(v, x)) , GKi = sapply(v, \(v) GKi(v, x)) , GKi2 = GKi2(v, x) ) # expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc # <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> #1 whichMax 92.03ms 102.32ms 9.15 NA 102. 5 56 #2 sum 74.91ms 77.84ms 12.0 NA 37.9 6 19 #3 findInterval 680.41µs 755.61µs 1263. NA 0 632 0 #4 Erop 57.19µs 62.13µs 12868. NA 24.0 6432 12 #5 GKi 54.53µs 60.4µs 13316. NA 24.0 6657 12 #6 GKi2 2.02µs 2.38µs 386027. NA 0 10000 0 </code></pre>

You can use findInterval which makes use of a binary search. <pre class="prettyprint"><code>findInterval(v, x) #[1] 3 </code></pre> Or using C++ <code>upper_bound</code> with <code>Rcpp</code>. <pre class="prettyprint"><code>Rcpp::cppFunction( "int upper_bound(double v, const Rcpp::NumericVector& x) { return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v)); }") upper_bound(v, x) #[1] 3 </code></pre> Or in case you have also a vector of positions like in <code>findInterval</code>. <pre class="prettyprint"><code>Rcpp::cppFunction(" Rcpp::IntegerVector upper_bound2(const Rcpp::NumericVector& v , const Rcpp::NumericVector& x) { Rcpp::IntegerVector res(v.length()); for(int i=0; i < res.length(); ++i) { res[i] = std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v[i])); } return res; }") v <- c(3, 6.45) upper_bound2(v, x) #[1] 1 3 findInterval(v, x) #[1] 1 3 </code></pre>

How can I efficiently find the index of a value in a sorted array?

Tags:

r

bisection

I have a sorted array of values and a single value like so:

x <- c(1.0, 3.45, 5.23, 7.3, 12.5, 23.45)
v <- 6.45

I can find the index of the value after which v would be inserted into x while maintaining the sorting order:

max(which(x <= v))
[1] 3

It is nice and compact code, but I have the gut feeling that behind-the-scenes this is really inefficient: since which() does not know that the array is sorted it has to inspect all values.

Is there a better way of finding this index value?

Note: I am not interested in actually merging v into x. I just want the index value.

997

asked Dec 06 '21 04:12

Patrick

Video Answer

3 Answers

If you need a faster version and you don't need to check your inputs you can write an easy C++ function:

Rcpp::cppFunction(
  "int foo(double x, const Rcpp::NumericVector& v)
  {
    int min = 0;
    int max = v.size();
    while (max - min > 1)
    {
      int idx = (min + max) / 2;
      if (v[idx] > x)
      {
        max = idx;
      }
      else
      {
        min = idx;
      }
    }
    return min + 1;
  }"
)

If you need it, you can check if (x < v[0]) by yourself (I don't know what you want to see in this case). And you can test it by using package microbenchmark:

library(microbenchmark)

n = 1e6
v = sort(rnorm(n, 0, 15))
x = runif(1, -15, 15)
microbenchmark(max(which(v <= x)), sum(v <= x), findInterval(x, v), foo(x, v))

Result:

Enter image description here

167

answered Oct 26 '22 11:10

Егор Шишунов

Benchmark based on Егор-Шишунов's answer:

# Functions:
Rcpp::cppFunction(
  "int Erop(double x, const Rcpp::NumericVector& v)
  {
    int min = 0;
    int max = v.size();
    while (max - min > 1)
    {
      int idx = (min + max) / 2;
      if (v[idx] > x)
      {
        max = idx;
      }
      else
      {
        min = idx;
      }
    }
    return min + 1;
  }"
)

Rcpp::cppFunction(
  "int GKi(double v, const Rcpp::NumericVector& x) {
     return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v));
}")

Rcpp::cppFunction("
Rcpp::IntegerVector GKi2(const Rcpp::NumericVector& v 
                       , const Rcpp::NumericVector& x) {
  Rcpp::IntegerVector res(v.length());
  for(int i=0; i < res.length(); ++i) {
    res[i] = std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v[i]));
  }
  return res;
}")

# Data:
set.seed(42)
x <- sort(rnorm(1e6))
v <- sort(c(sample(x, 15), rnorm(15)))

# Result:
bench::mark(whichMax= sapply(v, \(v) max(which(x <= v)))
          , sum = sapply(v, \(v) sum(x<=v))
          , findInterval = findInterval(v, x)
          , Erop = sapply(v, \(v) Erop(v, x))
          , GKi = sapply(v, \(v) GKi(v, x))
          , GKi2 = GKi2(v, x)
)
#  expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 whichMax      92.03ms 102.32ms      9.15        NA    102.      5    56
#2 sum           74.91ms  77.84ms     12.0         NA     37.9     6    19
#3 findInterval 680.41µs 755.61µs   1263.          NA      0     632     0
#4 Erop          57.19µs  62.13µs  12868.          NA     24.0  6432    12
#5 GKi           54.53µs   60.4µs  13316.          NA     24.0  6657    12
#6 GKi2           2.02µs   2.38µs 386027.          NA      0   10000     0

answered Oct 26 '22 11:10

7 revs, 2 users 91%

You can use findInterval which makes use of a binary search.

findInterval(v, x)
#[1] 3

Or using C++ upper_bound with Rcpp.

Rcpp::cppFunction(
  "int upper_bound(double v, const Rcpp::NumericVector& x) {
     return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v));
}")

upper_bound(v, x)
#[1] 3

Or in case you have also a vector of positions like in findInterval.

Rcpp::cppFunction("
Rcpp::IntegerVector upper_bound2(const Rcpp::NumericVector& v
                               , const Rcpp::NumericVector& x) {
  Rcpp::IntegerVector res(v.length());
  for(int i=0; i < res.length(); ++i) {
    res[i] = std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v[i]));
  }
  return res;
}")

v <- c(3, 6.45)
upper_bound2(v, x)
#[1] 1 3
findInterval(v, x)
#[1] 1 3

answered Oct 26 '22 11:10

GKi

Related questions
                            
                                R - Keep first observation per group identified by multiple variables (Stata equivalent "bys var1 var2 : keep if _n == 1")
                            
                                How to draw a line with color in shiny application
                            
                                Rolling Sum by Another Variable in R
                            
                                Align multiple ggplot graphs with and without legends [duplicate]
                            
                                How do I determine what packages are dependent on a given package in R?
                            
                                ID chunks of rows by start and end value
                            
                                Create integer sequences defined by 'from' and 'to' vectors
                            
                                Most frequent value (mode) by group [duplicate]
                            
                                Prediction with lme4 on new levels
                            
                                mutate rowSums exclude one column
                            
                                adding empty graphs to facet_wrap in ggplot2
                            
                                R split data into 2 parts randomly
                            
                                R: Remove repeated values and keep the first one in a binary vector
                            
                                Y axis won't start at 0 in ggplot
                            
                                Parsimonious way to add north arrow and scale bar to ggmap
                            
                                Conditionally include a list of child documents in RMarkdown with knitr
                            
                                How to use both starts_with and ends_with at the same time in one select statement?
                            
                                pandoc document conversion failed with error 2
                            
                                tidycensus::get_acs() geography options?
                            
                                Create discrete color bar with varying interval widths and no spacing between legend levels

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I efficiently find the index of a value in a sorted array?

Tags:

r

bisection

Patrick

People also ask

Video Answer

3 Answers

Егор Шишунов

7 revs, 2 users 91%

GKi

Recent Activity

Donate For Us