Suppose I have two vectors of values:
a <- c(1,3,4,5,6,7,3)
b <- c(3,5,1,3,2)
And I want to apply some function, FUN
, to each of the inputs of a
against the whole of b
, what's the most efficient way to do it.
More specifically, in this case for each of the elements in a
I want to know for each value of 'a', how many of the elements in b
are greater than or equal to that value. The naïve approach is to do the following:
sum(a < b)
Of course, this doesn't work as it attempts to iterate over each of the vectors in parallel and gives me the warning:
longer object length is not a multiple of shorter object length
The output, btw, of that command is 3
.
However, in my situation, what I'd like to see is an output that is:
0 2 4 4 5 5 2
Of course, I realize I can do it using a for loop as such:
out <- c()
for (i in a) {
for (i in a) { out[length(out) + 1] = sum(b<i)}
}
Likewise, I could use sapply
as such:
sapply(a, function(x)sum(b<x))
However, I'm trying to be a good R programmer and stay away from for loops and sapply
seems to be very slow. Are there other alternatives?
For what it's worth, I'm doing this a couple of million times where length(b)
is always less than length(a)
and length(a)
ranges from 1 to 30.
The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate the cosine similarity we use the following formula: Recall the cosine function: on the left the red vectors point at different angles and the graph on the right shows the resulting function.
Note that we can apply this R code only in case that both vectors have the same length (or in case the length of one vector is a multiplier of the length of the other vector). This condition does not have to be fulfilled in the next example…
Let’s do this… The previous output of the RStudio console shows that our first example vector has three vector elements: The characters A, B, and C. Our second example vector also contains three elements.
Method 1: The idea is to compare each variable individually to all the multiple values at a time. // This code is contributed by bunnyram19. Method 2 – using Bitmasking: Another approach is to check among multiple groups of values and then create a bitmask of the values and then check for that bit to be set.
Try this:
findInterval(a - 0.5, sort(b))
Speed improvement from a) avoiding sort
, and b) avoiding overhead in findInterval
and order
by using simpler .Internal
wrappers:
order2 = function(x) .Internal(order(T, F, x))
findInterval2 = function(x, vec, rightmost.closed=F, all.inside=F) {
nx <- length(x)
index <- integer(nx)
.C('find_interv_vec', xt=as.double(vec), n=length(vec),
x=as.double(x), nx=nx, as.logical(rightmost.closed),
as.logical(all.inside), index, DUP = FALSE, NAOK=T,
PACKAGE='base')
index
}
> system.time(for (i in 1:10000) findInterval(a - 0.5, sort(b)))
user system elapsed
1.22 0.00 1.22
> system.time(for (i in 1:10000) sapply(a, function(x)sum(b<x)))
user system elapsed
0.79 0.00 0.78
> system.time(for (i in 1:10000) rowSums(outer(a, b, ">")))
user system elapsed
0.72 0.00 0.72
> system.time(for (i in 1:10000) findInterval(a - 0.5, b[order(b)]))
user system elapsed
0.42 0.00 0.42
> system.time(for (i in 1:10000) findInterval2(a - 0.5, b[order2(b)]))
user system elapsed
0.16 0.00 0.15
The complexity of defining findInterval2
and order2
is probably only warranted if you have heaps of iterations with fairly small N.
Also timings for larger N:
> a = rep(a, 100)
> b = rep(b, 100)
> system.time(for (i in 1:100) findInterval(a - 0.5, sort(b)))
user system elapsed
0.01 0.00 0.02
> system.time(for (i in 1:100) sapply(a, function(x)sum(b<x)))
user system elapsed
0.67 0.00 0.68
> system.time(for (i in 1:100) rowSums(outer(a, b, ">")))
user system elapsed
3.67 0.26 3.94
> system.time(for (i in 1:100) findInterval(a - 0.5, b[order(b)]))
user system elapsed
0 0 0
> system.time(for (i in 1:100) findInterval2(a - 0.5, b[order2(b)]))
user system elapsed
0 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With