Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subset data by set of intervals in R

Tags:

r

subset

I want to exclude values from vector according to a set of intervals.

Example data:

mydata <-  sort(runif(100,0,300))
mIntervals <- data.frame(start = c(2,50,97,159) , end = c(5,75, 120, 160))

Solution1: using simple subset() - not suitable - length of mIntervals may be quite big

Solution2: using nested for loops:

valid <- vector(length(mydata))
valid <- TRUE
for(i in 1:length(mydata){
 for(j in 1:length(mIntervals){
  if(mydata[i] > mIntervals[j,]$start & mydata[i] < mIntervals[j,]$end){
   valid[i] <- FALSE
  }
 }
} 
mydata[valid]

this solution is taking too long in R.

Solution3: function findIntervals

   require(FSA)
   valid <- findInterval(mydata, sort(c(mIntervals$start, mIntervals$end)))
   mydata[is.even(valid)]

Solution4: use somehow package 'Intervals', but there is also no suitable function (maybe interval_overlap())

Quite similar (but not identical) issue was discussed already here. But there are solutions for vector of integers, not for continuous variable.

I have no more ideas. Solution no. 3 seems to be the best, but I don't like it - it is not robust - you would have to check for overlapping intervals, etc.

Is there any better solution to this very simple looking problem? Thx

Real data: I have light intensity measured at some times (datetime, intensity). I also have intervals of datetime where the measuring device was under maintenance (start, end). Now I want to clean data = exclude values measured during maintenance periods (efficiently!).

like image 968
Dead Vil Avatar asked Nov 30 '22 16:11

Dead Vil


1 Answers

Using the development version (1.9.7) of data.table, we can try %anywhere%:

library(data.table)
# %anywhere% returns TRUE if mydata is within any mIntervals, else FALSE
ans <- mydata[!mydata %anywhere% mIntervals] 

This will include the endpoints however as incbounds = TRUE is the default setting. If you need to exclude the endpoints you can use the following syntax:

mydata[!anywhere(mydata, mIntervals[, 1], mIntervals[, 2], incbounds = FALSE)]
like image 185
mtoto Avatar answered Dec 05 '22 11:12

mtoto