I have a long vector, thousands of entries, which has elements 0, 1, 2 in it sporadically. 0 means "no signal", 1 means "signal on", and 2 means "signal off". I am trying to find the runs from 1 to the next occurrence of 2 and fill the space with 1s. I also need to do the same thing between a 2 and the next occurrence of 1 but fill the space with 0s.
I currently have a solution for this issue using loops but it's slow and incredibly inefficient:
example vector:
exp = c(1,1,1,0,0,1,2,0,2,0,1,0,2)
desired result:
1,1,1,1,1,1,2,0,0,0,1,1,2
Thank you
You could use rle
& shift
from the data.table-package in the following way:
library(data.table)
# create the run-length object
rl <- rle(x)
# create indexes of the spots in the run-length object that need to be replaced
idx1 <- rl$values == 0 & shift(rl$values, fill = 0) == 1 & shift(rl$values, fill = 0, type = 'lead') %in% 1:2
idx0 <- rl$values == 2 & shift(rl$values, fill = 0) == 0 & shift(rl$values, fill = 2, type = 'lead') %in% 0:1
# replace these values
rl$values[idx1] <- 1
rl$values[idx0] <- 0
Now you will get the desired result by using inverse.rle
:
> inverse.rle(rl)
[1] 1 1 1 1 1 1 2 0 0 0 1 1 2
As an alternative for the shift
-function, you could also use the lag
and lead
functions from dplyr.
If you want to assess the speed of both approaches, the microbenchmark
-package is a useful tool. Below you'll find 3 benchmarks, each for a different vector size:
# create functions for both approaches
jaap <- function(x) {
rl <- rle(x)
idx1 <- rl$values == 0 & shift(rl$values, fill = 0) == 1 & shift(rl$values, fill = 0, type = 'lead') %in% 1:2
idx0 <- rl$values == 2 & shift(rl$values, fill = 0) == 0 & shift(rl$values, fill = 2, type = 'lead') %in% 0:1
rl$values[idx1] <- 1
rl$values[idx0] <- 0
inverse.rle(rl)
}
john <- function(x) {
Reduce(f, x, 0, accumulate = TRUE)[-1]
}
Execute the benchmarks:
# benchmark on the original data
> microbenchmark(jaap(x), john(x), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
jaap(x) 58.766 61.2355 67.99861 63.8755 72.147 143.841 100 b
john(x) 13.684 14.3175 18.71585 15.7580 23.902 50.705 100 a
# benchmark on a somewhat larger vector
> x2 <- rep(x, 10)
> microbenchmark(jaap(x2), john(x2), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
jaap(x2) 69.778 72.802 84.46945 76.9675 87.3015 184.666 100 a
john(x2) 116.858 121.058 127.64275 126.1615 130.4515 223.303 100 b
# benchmark on a very larger vector
> x3 <- rep(x, 1e6)
> microbenchmark(jaap(x3), john(x3), times = 20)
Unit: seconds
expr min lq mean median uq max neval cld
jaap(x3) 1.30326 1.337878 1.389187 1.391279 1.425186 1.556887 20 a
john(x3) 10.51349 10.616632 10.689535 10.670808 10.761191 10.918953 20 b
From this you can conclude that the rle
-approach has an advantage when applied to vectors that are larger than 100 elements (which is probably nearly always).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With