I want to exclude values from vector according to a set of intervals.
Example data:
mydata <- sort(runif(100,0,300))
mIntervals <- data.frame(start = c(2,50,97,159) , end = c(5,75, 120, 160))
Solution1: using simple subset() - not suitable - length of mIntervals may be quite big
Solution2: using nested for loops:
valid <- vector(length(mydata))
valid <- TRUE
for(i in 1:length(mydata){
for(j in 1:length(mIntervals){
if(mydata[i] > mIntervals[j,]$start & mydata[i] < mIntervals[j,]$end){
valid[i] <- FALSE
}
}
}
mydata[valid]
this solution is taking too long in R.
Solution3: function findIntervals
require(FSA)
valid <- findInterval(mydata, sort(c(mIntervals$start, mIntervals$end)))
mydata[is.even(valid)]
Solution4: use somehow package 'Intervals', but there is also no suitable function (maybe interval_overlap())
Quite similar (but not identical) issue was discussed already here. But there are solutions for vector of integers, not for continuous variable.
I have no more ideas. Solution no. 3 seems to be the best, but I don't like it - it is not robust - you would have to check for overlapping intervals, etc.
Is there any better solution to this very simple looking problem? Thx
Real data: I have light intensity measured at some times (datetime, intensity). I also have intervals of datetime where the measuring device was under maintenance (start, end). Now I want to clean data = exclude values measured during maintenance periods (efficiently!).
Using the development version (1.9.7) of data.table
, we can try %anywhere%
:
library(data.table)
# %anywhere% returns TRUE if mydata is within any mIntervals, else FALSE
ans <- mydata[!mydata %anywhere% mIntervals]
This will include the endpoints however as incbounds = TRUE
is the default setting. If you need to exclude the endpoints you can use the following syntax:
mydata[!anywhere(mydata, mIntervals[, 1], mIntervals[, 2], incbounds = FALSE)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With