Let be a sequence of TRUE and FALSE in R
v = c(F,F,F,F,F,F,T,F,T,T,F,T,T,T,T,T,F,T,F,T,T,F,F,F,T,F,F,F,F,F)
I would like to get the the positions of the first and the last TRUE. One way to achieve this is
range(which(v)) # 7 25
but this solution is relatively slow as it must check every element of the vector to get the position of each TRUE and then loop over all positions, evaluating two if
statements at each position (I think) in order to get the maximum and the minimum values. It would be much more strategic to search for the first TRUE starting one from the beginning and one from the end and just return those positions.
Is there a faster alternative to range(which(..))
?
The simplest approach I can think of that doesn't involve searching the entire vector would be an Rcpp solution:
library(Rcpp)
cppFunction(
"NumericVector rangeWhich(LogicalVector x) {
NumericVector ret(2, NumericVector::get_na());
int n = x.size();
for (int idx=0; idx < n; ++idx) {
if (x[idx]) {
ret[0] = idx+1; // 1-indexed for R
break;
}
}
if (R_IsNA(ret[0])) return ret; // No true values
for (int idx=n-1; idx >= 0; --idx) {
if (x[idx]) {
ret[1] = idx + 1; // 1-indexed for R
break;
}
}
return ret;
}")
rangeWhich(v)
# [1] 7 25
We can benchmark on a fairly long vector (length 1 million) with random entries. We would expect to get pretty large efficiency gains from not searching through the whole thing with which
:
set.seed(144)
bigv <- sample(c(F, T), 1000000, replace=T)
library(microbenchmark)
# range_find from @PierreLafortune
range_find <- function(v) {
i <- 1
while(!v[i]) {
i <- i +1
}
j <- length(v)
while(!v[j]) {
j <- j-1
}
c(i,j)
}
# shortCircuit from @JoshuaUlrich
shortCircuit <- compiler::cmpfun({
function(x) {
first <- 1
while(TRUE) if(x[first]) break else first <- first+1
last <- length(x)
while(TRUE) if(x[last]) break else last <- last-1
c(first, last)
}
})
microbenchmark(rangeWhich(bigv), range_find(bigv), shortCircuit(bigv), range(which(bigv)))
# Unit: microseconds
# expr min lq mean median uq max neval
# rangeWhich(bigv) 1.476 2.4655 9.45051 9.0640 13.7585 46.286 100
# range_find(bigv) 1.445 2.2930 8.06993 7.2055 11.8980 26.893 100
# shortCircuit(bigv) 1.114 1.6920 7.30925 7.0440 10.2210 30.758 100
# range(which(bigv)) 6821.180 9389.1465 13991.84613 10007.9045 16698.2230 58112.490 100
The Rcpp solution is a good deal faster (more than 500x faster) than max(which(v))
because it doesn't need to iterate through the whole vector with which
. For this example it has a near-identical runtime (in fact, slightly slower) than range_find
from @PierreLafortune and shortCircuit
from @JoshuaUlrich.
Using Joshua's excellent example of some worst-case behavior where the true value is in the very middle of the vector (I'm repeating his experiment with all proposed functions so we can see the whole picture), we see a very different situation:
bigv2 <- rep(FALSE, 1e6)
bigv2[5e5-1] <- TRUE
bigv2[5e5+1] <- TRUE
microbenchmark(rangeWhich(bigv2), range_find(bigv2), shortCircuit(bigv2), range(which(bigv2)))
# Unit: microseconds
# expr min lq mean median uq max neval
# rangeWhich(bigv2) 546.206 555.3820 593.1385 575.3790 599.055 979.924 100
# range_find(bigv2) 400057.083 406449.0075 434515.1142 411881.4145 427487.041 697529.163 100
# shortCircuit(bigv2) 74942.612 75663.7835 79095.3795 76761.5325 79703.265 125054.360 100
# range(which(bigv2)) 632.086 679.0955 761.9610 700.1365 746.509 3924.941 100
For this vector the looping base R solutions are much slower than the original solution (100-600x slower) and the Rcpp solution is barely faster than range(which(bigv2))
(which makes sense, because they're both looping through the whole vector once).
As usual, this needs to come with a disclaimer -- you need to compile your Rcpp function, which also takes time, so this will only be a benefit if you have very large vectors or are repeating this operation many times. From the comments on your question it sounds like you indeed have a large number of large vectors, so this could be a good option for you.
match
is quick as it stops when it finds the value searched for:
c(match(T,v),length(v)-match(T,rev(v))+1)
[1] 7 25
But you would have to test the speeds.
Update:
range_find <- function(v) {
i <- 1
j <- length(v)
while(!v[i]) {
i <- i+1
}
while(!v[j]) {
j <- j-1
}
c(i,j)
}
Benchmark
v <- rep(v, 5e4)
microbenchmark(
rangeWhich = rangeWhich(v),
range_find = range_find(v),
richwhich = {w <- which(v)
w[c(1L, length(w))]},
match = c(match(T,v),length(v)-match(T,rev(v))+1)
)
Unit: microseconds
expr min lq mean median uq max neval
rangeWhich 1.284 3.2090 16.50914 20.211 26.7875 29.836 100
range_find 9.945 21.4945 32.02652 26.948 34.1660 144.042 100
richwhich 2941.756 3022.5975 3243.02081 3130.227 3247.6405 5403.911 100
match 45696.329 46771.8175 50662.45708 47359.526 48718.6055 131439.661 100
This approach matches your proposed strategy:
"It would be much more strategic to search for the first TRUE starting one from the beginning and one from the end and just return those positions."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With