Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count the number of values in a window of a data.table per group

Tags:

r

data.table

I'm trying to add new columns to data.table, where values in rows depend on the relative relationship of the values in the row. To be more precise, if there is a value X in a row, I would like to know how many other values are in the same column (and group), that are within X-30.

That is, given this:

DT<-data.table(
X = c(1, 2, 2, 1, 1, 2,  1, 2, 2, 1, 1, 1),
Y = c(100, 101, 133, 134, 150, 156,  190, 200, 201, 230, 233, 234),
Z = c(1, 2, 3, 4, 5, 6,  7, 8, 9, 10, 11, 12))

I would like to get a new column, with values:

N <- c(0, 0, 0, 0, 1, 1,  0, 0, 1, 0, 1, 2)

I've tried the following, but I don't get the results I could use:

DT[,list(Y,num=cumsum(Y[-.I]>DT[.I,Y]-30),Z),by=.(X)]

Any ideas how to do this?

like image 278
Nikola Knezevic Avatar asked Mar 14 '16 16:03

Nikola Knezevic


2 Answers

This is probably can be achieved with a rolling join (?), but here is a foverlaps alternative for now

DT[, `:=`(indx = .I, Y2 = Y - 30L, N = 0L)] # Add row index and a -30 interval
setkey(DT, X, Y2, Y) # Sort by X and the intervals (for fovelaps)
res <- foverlaps(DT, DT)[Y2 > i.Y2, .N, keyby = indx] # Run foverlaps and check what can we catch
setorder(DT, indx) # go back to the original order
DT[res$indx, N := res$N][, c("indx", "Y2") := NULL] # update results and remove cols
DT
#     X   Y  Z N
#  1: 1 100  1 0
#  2: 2 101  2 0
#  3: 2 133  3 0
#  4: 1 134  4 0
#  5: 1 150  5 1
#  6: 2 156  6 1
#  7: 1 190  7 0
#  8: 2 200  8 0
#  9: 2 201  9 1
# 10: 1 230 10 0
# 11: 1 233 11 1
# 12: 1 234 12 2

Alternately, use the which=TRUE option of foverlaps to make the overlap merge smaller:

# as above
DT[, `:=`(indx = .I, Y2 = Y - 30L, N = 0L)]
setkey(DT, X, Y2, Y)

# using which=TRUE:
res <- foverlaps(DT, DT, which=TRUE)[xid > yid, .N, by=xid]
DT[res$xid, N := res$N]
setorder(DT, indx)
DT[, c("Y2","indx") := NULL]
like image 179
David Arenburg Avatar answered Sep 30 '22 11:09

David Arenburg


Here's another way:

DT[order(Y), N := 0:(.N-1) - findInterval(Y - 30, Y), by = X]

all.equal(DT$N,N) # TRUE
like image 33
Frank Avatar answered Sep 30 '22 11:09

Frank