Say we have a couple vectors
a <- c(1, 2, 2, 4, 7)
b <- c(1, 2, 3, 5, 7)
For each element b[i]
in b
I want find the number of elements in a
that's less than b[i]
, or, equivalent, I want to know the rank of b_i in c(b[i], a)
.
there are a couple naive ways I can think of, e.g. doing either of the following length(b)
times:
min_rank(c(b[i], a))
sum(a < b[i])
What's the best way to do this if length(a)
= length(b)
= N where N is large?
EDIT:
To clarify, I'm wondering if there's a more computationally efficient way to do this, i.e. if I can do better than quadratic time in this case.
Vectorization is always cool though ;), thanks @Henrik!
Running time
a <- rpois(100000, 20)
b <- rpois(100000, 10)
system.time(
result1 <- sapply(b, function(x) sum(a < x))
)
# user system elapsed
# 71.15 0.00 71.16
sw <- proc.time()
bu <- sort(unique(b))
ab <- sort(c(a, bu))
ind <- match(bu, ab)
nbelow <- ind - 1:length(bu)
result2 <- sapply(b, function(x) nbelow[match(x, bu)])
proc.time() - sw
# user system elapsed
# 0.46 0.00 0.48
sw <- proc.time()
a1 <- sort(a)
result3 <- findInterval(b - sqrt(.Machine$double.eps), a1)
proc.time() - sw
# user system elapsed
# 0.00 0.00 0.03
identical(result1, result2) && identical(result2, result3)
# [1] TRUE
Assuming that a
is weakly sorted increasingly, use findInterval
:
a <- sort(a)
## gives points less than or equal to b[i]
findInterval(b, a)
# [1] 1 3 3 4 5
## to do strictly less than, subtract a small bit from b
## uses .Machine$double.eps (the smallest distinguishable difference)
findInterval(b - sqrt(.Machine$double.eps), a)
# [1] 0 1 3 4 4
If you're really optimising this process for large N, then you may want to remove duplicate values in b
at least initially, and then you can sort and match:
bu <- sort(unique(b))
ab <- sort(c(a, bu))
ind <- match(bu, ab)
nbelow <- ind - 1:length(bu)
As we've merged a and b values into ab, the match
includes all a less than the specific value of b together with all b's, so that's why we remove the cummulative count of b on the final line. I suspect this may be faster for large sets - it should be if match
is internally optimised for sorted lists, which one would hope to be the case. It should then be a trivial matter to map back nbelow
to your original set of b
s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With