Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping a data.table by running intervals

I am using R with package data.table and I would like to group a data.table by running (time) intervals or overlapping bins. For each of these running intervals I would like to find the occurence of equal pairs of data. Further more these "equal pairs of data" should be not exactly equal, but in some interval range, too.

The simple version of the question is as following:

#Time   X   Y Counts
# ... ... ...      1
#I would like to do:
DT[, sum(counts), by = list(Time, X, Y)]
#with Time, X and Y being in overlapping intervals.

findintervals() would give me bins with "hard borders", not overlapping ones.

The problem in more detail: Let's say I have a data.table like that:

Time    <- c(1,1,2,4,5,5,6,7,8,8,8,8,12,13)
#more equal time values are allowed.
X       <- c(6,6,7,10,5,7,6,3,9,10,6,3,3,6)
Y       <- c(2,6,10,3,4,6,6,9,4,9,6,6,9,9)
DT      <- data.table(Time, X, Y)

    Time  X  Y
 1:    1  6  2
 2:    1  6  6
 3:    2  7 10
 4:    4 10  3
 5:    5  5  4
 6:    5  7  6
 7:    6  6  6
 8:    7  3  9
 9:    8  9  4
10:    8 10  9
11:    8  6  6
12:    8  3  6
13:   12  3  9
14:   13  6  9

And some predefined interval sizes:

Timeinterval      <- 5
#for a time value of 10 this means to look from 10-5 to 10+5
RangeX.percentage <- 0.5 
RangeY.percentage <- 0.5

The result should give me an additional column, let's call it "counts" with the occurence of equal pairs of data X and Y considering the ranges for X and Y.

I thought about some kind of grouping by time intervals like

c(1, 1, 2, 4, 5, 5, 6) #for the first item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7) # for the second item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7, 8, 8, 8, 8) #for the third item (2-5):(2+5)
#...
c(8, 8, 8, 8, 12, 13) # for the last item (13-5):(13+5)

and the following conditions for the data (but maybe there is a simpler version for that part too):

EDIT: To clearify what the result should look like:

Ranges <- DT[ , list(
             X* (1 + RangeX.percentage), X* (1 - RangeX.percentage),
             Y* (1 + RangeY.percentage), Y* (1 - RangeY.percentage))]
DT2 <- cbind(DT, Ranges, count = rep(1, nrow(DT)))
setnames(DT2, c("Time","X","Y","XR1","XR2","YR1","YR2","count"))
for (i in 1:nrow(DT2)){
  #main part of the question how to get this done within data.table:
  DT2.subset <- DT2[which(abs(Time - DT2[i]$Time) < Timeinterval)]
  #subsequent comparison of X and Y:
  DT[i]$Count<- length(which(DT2.subset$X < DT2[i]$XR1 & 
                             DT2.subset$X > DT2[i]$XR2 &
                             DT2.subset$Y < DT2[i]$YR1 & 
                             DT2.subset$Y > DT2[i]$YR2))
}
 DT2
    Time  X  Y  XR1 XR2  YR1 YR2 count
 1:    1  6  2  9.0 3.0  3.0 1.0     1
 2:    1  6  6  9.0 3.0  9.0 3.0     3
 3:    2  7 10 10.5 3.5 15.0 5.0     4
 4:    4 10  3 15.0 5.0  4.5 1.5     3
 5:    5  5  4  7.5 2.5  6.0 2.0     1
 6:    5  7  6 10.5 3.5  9.0 3.0     6
 7:    6  6  6  9.0 3.0  9.0 3.0     4
 8:    7  3  9  4.5 1.5 13.5 4.5     2
 9:    8  9  4 13.5 4.5  6.0 2.0     3
10:    8 10  9 15.0 5.0 13.5 4.5     4
11:    8  6  6  9.0 3.0  9.0 3.0     4
12:    8  3  6  4.5 1.5  9.0 3.0     1
13:   12  3  9  4.5 1.5 13.5 4.5     2
14:   13  6  9  9.0 3.0 13.5 4.5     1

As my complete data.table contains more than a million rows, checking all DT$time for each row is a mess in terms of computation time.

like image 650
Phann Avatar asked Mar 01 '16 09:03

Phann


1 Answers

You could try data.table::foverlaps. We will create Ranges pretty much as you did, just with addition for Time ranges and a row index (for later aggregation). The main issue here is that you don't want <= or >= rather < and >, so we will have to add +-1 to the Time intervals. Then, we will add a Time interval to DT too, key, and run foverlaps. The final stage is to count observation per row.

DT[, Time2 := Time] ## Add higher interval to DT
setkey(DT, Time, Time2) ## key (for foverlaps)

Ranges <- 
  DT[ , .(Time = Time - Timeinterval + 1, ## Add lower Time interval
          Time2 = Time + Timeinterval - 1, ## Add higher Time interval
          XR1 = X* (1 - RangeX.percentage), 
          XR2 = X* (1 + RangeX.percentage),
          YR1 = Y* (1 - RangeY.percentage), 
          YR2 = Y* (1 + RangeY.percentage),
          indx = .I)] ## Add row index

# Run foverlaps and count incidences by condition while updating DT by reference
DT[, 
   count := foverlaps(Ranges, DT)[X > XR1 & X < XR2 & Y > YR1 & Y < YR2,
                                   .N, 
                                   keyby = indx]$N]  
DT
#     Time  X  Y Time2  count
#  1:    1  6  2     1      1
#  2:    1  6  6     1      3
#  3:    2  7 10     2      4
#  4:    4 10  3     4      3
#  5:    5  5  4     5      1
#  6:    5  7  6     5      6
#  7:    6  6  6     6      4
#  8:    7  3  9     7      2
#  9:    8  9  4     8      3
# 10:    8 10  9     8      4
# 11:    8  6  6     8      4
# 12:    8  3  6     8      1
# 13:   12  3  9    12      2
# 14:   13  6  9    13      1
like image 101
David Arenburg Avatar answered Oct 24 '22 09:10

David Arenburg