Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal method of comparing a vector of numbers to values in another vector

Tags:

r

vector

Suppose I have two vectors of values:

a <- c(1,3,4,5,6,7,3)
b <- c(3,5,1,3,2)

And I want to apply some function, FUN, to each of the inputs of a against the whole of b, what's the most efficient way to do it.

More specifically, in this case for each of the elements in a I want to know for each value of 'a', how many of the elements in b are greater than or equal to that value. The naïve approach is to do the following:

sum(a < b)

Of course, this doesn't work as it attempts to iterate over each of the vectors in parallel and gives me the warning:

longer object length is not a multiple of shorter object length

The output, btw, of that command is 3.

However, in my situation, what I'd like to see is an output that is:

0 2 4 4 5 5 2

Of course, I realize I can do it using a for loop as such:

out <- c()
for (i in a) {
    for (i in a) { out[length(out) + 1] = sum(b<i)}
}

Likewise, I could use sapply as such:

sapply(a, function(x)sum(b<x))

However, I'm trying to be a good R programmer and stay away from for loops and sapply seems to be very slow. Are there other alternatives?

For what it's worth, I'm doing this a couple of million times where length(b) is always less than length(a) and length(a) ranges from 1 to 30.

like image 210
Pridkett Avatar asked Feb 24 '11 20:02

Pridkett


People also ask

How to find the cosine similarity of two vectors?

The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate the cosine similarity we use the following formula: Recall the cosine function: on the left the red vectors point at different angles and the graph on the right shows the resulting function.

When to apply the R code with the same vector?

Note that we can apply this R code only in case that both vectors have the same length (or in case the length of one vector is a multiplier of the length of the other vector). This condition does not have to be fulfilled in the next example…

How many elements are there in a vector?

Let’s do this… The previous output of the RStudio console shows that our first example vector has three vector elements: The characters A, B, and C. Our second example vector also contains three elements.

How do you compare two variables with different values?

Method 1: The idea is to compare each variable individually to all the multiple values at a time. // This code is contributed by bunnyram19. Method 2 – using Bitmasking: Another approach is to check among multiple groups of values and then create a bitmask of the values and then check for that bit to be set.


1 Answers

Try this:

findInterval(a - 0.5, sort(b))

Speed improvement from a) avoiding sort, and b) avoiding overhead in findInterval and order by using simpler .Internal wrappers:

order2 = function(x) .Internal(order(T, F, x))

findInterval2 = function(x, vec, rightmost.closed=F, all.inside=F) {
  nx <- length(x)
  index <- integer(nx)
  .C('find_interv_vec', xt=as.double(vec), n=length(vec),
    x=as.double(x), nx=nx, as.logical(rightmost.closed),
    as.logical(all.inside), index, DUP = FALSE, NAOK=T,
    PACKAGE='base')
  index
}

> system.time(for (i in 1:10000) findInterval(a - 0.5, sort(b)))
   user  system elapsed 
   1.22    0.00    1.22 
> system.time(for (i in 1:10000) sapply(a, function(x)sum(b<x)))
   user  system elapsed 
   0.79    0.00    0.78 
> system.time(for (i in 1:10000) rowSums(outer(a, b, ">")))
   user  system elapsed 
   0.72    0.00    0.72 
> system.time(for (i in 1:10000) findInterval(a - 0.5, b[order(b)]))
   user  system elapsed 
   0.42    0.00    0.42 
> system.time(for (i in 1:10000) findInterval2(a - 0.5, b[order2(b)]))
   user  system elapsed 
   0.16    0.00    0.15 

The complexity of defining findInterval2 and order2 is probably only warranted if you have heaps of iterations with fairly small N.

Also timings for larger N:

> a = rep(a, 100)
> b = rep(b, 100)
> system.time(for (i in 1:100) findInterval(a - 0.5, sort(b)))
   user  system elapsed 
   0.01    0.00    0.02 
> system.time(for (i in 1:100) sapply(a, function(x)sum(b<x)))
   user  system elapsed 
   0.67    0.00    0.68 
> system.time(for (i in 1:100) rowSums(outer(a, b, ">")))
   user  system elapsed 
   3.67    0.26    3.94 
> system.time(for (i in 1:100) findInterval(a - 0.5, b[order(b)]))
   user  system elapsed 
      0       0       0 
> system.time(for (i in 1:100) findInterval2(a - 0.5, b[order2(b)]))
   user  system elapsed 
      0       0       0 
like image 181
Charles Avatar answered Nov 15 '22 06:11

Charles