Suppose I have two vectors of values: <pre class="prettyprint"><code>a <- c(1,3,4,5,6,7,3) b <- c(3,5,1,3,2) </code></pre> And I want to apply some function, <code>FUN</code>, to each of the inputs of <code>a</code> against the whole of <code>b</code>, what's the most efficient way to do it. More specifically, in this case for each of the elements in <code>a</code> I want to know for each value of 'a', how many of the elements in <code>b</code> are greater than or equal to that value. The naïve approach is to do the following: <pre class="prettyprint"><code>sum(a < b) </code></pre> Of course, this doesn't work as it attempts to iterate over each of the vectors in parallel and gives me the warning: <blockquote> longer object length is not a multiple of shorter object length </blockquote> The output, btw, of that command is <code>3</code>. However, in my situation, what I'd like to see is an output that is: <pre class="prettyprint"><code>0 2 4 4 5 5 2 </code></pre> Of course, I realize I can do it using a for loop as such: <pre class="prettyprint"><code>out <- c() for (i in a) { for (i in a) { out[length(out) + 1] = sum(b<i)} } </code></pre> Likewise, I could use <code>sapply</code> as such: <pre class="prettyprint"><code>sapply(a, function(x)sum(b<x)) </code></pre> However, I'm trying to be a good R programmer and stay away from for loops and <code>sapply</code> seems to be very slow. Are there other alternatives? For what it's worth, I'm doing this a couple of million times where <code>length(b)</code> is always less than <code>length(a)</code> and <code>length(a)</code> ranges from 1 to 30.

Try this: <pre class="prettyprint"><code>findInterval(a - 0.5, sort(b)) </code></pre> <hr> Speed improvement from a) avoiding <code>sort</code>, and b) avoiding overhead in <code>findInterval</code> and <code>order</code> by using simpler <code>.Internal</code> wrappers: <pre class="prettyprint"><code>order2 = function(x) .Internal(order(T, F, x)) findInterval2 = function(x, vec, rightmost.closed=F, all.inside=F) { nx <- length(x) index <- integer(nx) .C('find_interv_vec', xt=as.double(vec), n=length(vec), x=as.double(x), nx=nx, as.logical(rightmost.closed), as.logical(all.inside), index, DUP = FALSE, NAOK=T, PACKAGE='base') index } > system.time(for (i in 1:10000) findInterval(a - 0.5, sort(b))) user system elapsed 1.22 0.00 1.22 > system.time(for (i in 1:10000) sapply(a, function(x)sum(b<x))) user system elapsed 0.79 0.00 0.78 > system.time(for (i in 1:10000) rowSums(outer(a, b, ">"))) user system elapsed 0.72 0.00 0.72 > system.time(for (i in 1:10000) findInterval(a - 0.5, b[order(b)])) user system elapsed 0.42 0.00 0.42 > system.time(for (i in 1:10000) findInterval2(a - 0.5, b[order2(b)])) user system elapsed 0.16 0.00 0.15 </code></pre> The complexity of defining <code>findInterval2</code> and <code>order2</code> is probably only warranted if you have heaps of iterations with fairly small N. Also timings for larger N: <pre class="prettyprint"><code>> a = rep(a, 100) > b = rep(b, 100) > system.time(for (i in 1:100) findInterval(a - 0.5, sort(b))) user system elapsed 0.01 0.00 0.02 > system.time(for (i in 1:100) sapply(a, function(x)sum(b<x))) user system elapsed 0.67 0.00 0.68 > system.time(for (i in 1:100) rowSums(outer(a, b, ">"))) user system elapsed 3.67 0.26 3.94 > system.time(for (i in 1:100) findInterval(a - 0.5, b[order(b)])) user system elapsed 0 0 0 > system.time(for (i in 1:100) findInterval2(a - 0.5, b[order2(b)])) user system elapsed 0 0 0 </code></pre>

Optimal method of comparing a vector of numbers to values in another vector

Tags:

r

vector

Suppose I have two vectors of values:

a <- c(1,3,4,5,6,7,3)
b <- c(3,5,1,3,2)

And I want to apply some function, FUN, to each of the inputs of a against the whole of b, what's the most efficient way to do it.

More specifically, in this case for each of the elements in a I want to know for each value of 'a', how many of the elements in b are greater than or equal to that value. The naïve approach is to do the following:

sum(a < b)

Of course, this doesn't work as it attempts to iterate over each of the vectors in parallel and gives me the warning:

longer object length is not a multiple of shorter object length

The output, btw, of that command is 3.

However, in my situation, what I'd like to see is an output that is:

0 2 4 4 5 5 2

Of course, I realize I can do it using a for loop as such:

out <- c()
for (i in a) {
    for (i in a) { out[length(out) + 1] = sum(b<i)}
}

Likewise, I could use sapply as such:

sapply(a, function(x)sum(b<x))

However, I'm trying to be a good R programmer and stay away from for loops and sapply seems to be very slow. Are there other alternatives?

For what it's worth, I'm doing this a couple of million times where length(b) is always less than length(a) and length(a) ranges from 1 to 30.

210

asked Feb 24 '11 20:02

Pridkett

1 Answers

Try this:

findInterval(a - 0.5, sort(b))

Speed improvement from a) avoiding sort, and b) avoiding overhead in findInterval and order by using simpler .Internal wrappers:

order2 = function(x) .Internal(order(T, F, x))

findInterval2 = function(x, vec, rightmost.closed=F, all.inside=F) {
  nx <- length(x)
  index <- integer(nx)
  .C('find_interv_vec', xt=as.double(vec), n=length(vec),
    x=as.double(x), nx=nx, as.logical(rightmost.closed),
    as.logical(all.inside), index, DUP = FALSE, NAOK=T,
    PACKAGE='base')
  index
}

> system.time(for (i in 1:10000) findInterval(a - 0.5, sort(b)))
   user  system elapsed 
   1.22    0.00    1.22 
> system.time(for (i in 1:10000) sapply(a, function(x)sum(b<x)))
   user  system elapsed 
   0.79    0.00    0.78 
> system.time(for (i in 1:10000) rowSums(outer(a, b, ">")))
   user  system elapsed 
   0.72    0.00    0.72 
> system.time(for (i in 1:10000) findInterval(a - 0.5, b[order(b)]))
   user  system elapsed 
   0.42    0.00    0.42 
> system.time(for (i in 1:10000) findInterval2(a - 0.5, b[order2(b)]))
   user  system elapsed 
   0.16    0.00    0.15

The complexity of defining findInterval2 and order2 is probably only warranted if you have heaps of iterations with fairly small N.

Also timings for larger N:

> a = rep(a, 100)
> b = rep(b, 100)
> system.time(for (i in 1:100) findInterval(a - 0.5, sort(b)))
   user  system elapsed 
   0.01    0.00    0.02 
> system.time(for (i in 1:100) sapply(a, function(x)sum(b<x)))
   user  system elapsed 
   0.67    0.00    0.68 
> system.time(for (i in 1:100) rowSums(outer(a, b, ">")))
   user  system elapsed 
   3.67    0.26    3.94 
> system.time(for (i in 1:100) findInterval(a - 0.5, b[order(b)]))
   user  system elapsed 
      0       0       0 
> system.time(for (i in 1:100) findInterval2(a - 0.5, b[order2(b)]))
   user  system elapsed 
      0       0       0

181

answered Nov 15 '22 06:11

Charles

Related questions
                            
                                ggplot2 change fill for color legend when fill also used in aesthetic
                            
                                How can I keep pivot_wider() from dropping factor levels in names?
                            
                                Analog to Pandas Series.value_counts() in R? [duplicate]
                            
                                reticulate doesn't print to console in real time
                            
                                convert a list of vectors to data frame
                            
                                Example in Advanced R of modifying a list
                            
                                Attempting to run RPY2 in Python and receiving error 0X7e
                            
                                Subset a table by columns and rows using a named vector in R
                            
                                How to unread hide excel sheet in R(read_excel)?
                            
                                Plotting decision tree results from tidymodels
                            
                                TidyText Clustering
                            
                                Grouped recurrence by periods over a data.table
                            
                                Cumulative vector in data table
                            
                                Rstudio pipe operator (%>%) shortcut (Ctrl+Shift+M) not working
                            
                                R: Extracting Rules from a Decision Tree
                            
                                Generate stochastic random deviates from a density object with R
                            
                                R : catching errors in `nls`
                            
                                Extracting Nouns and Verbs from Text
                            
                                How to write a c() function for custom S3 class in R
                            
                                Setting an xts Index

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With