The data.table package in R provides the option:
which: ‘TRUE’ returns the integer row numbers of ‘x’ that ‘i’ matches to.
However, I see no way of obtaining, within j
, the integer row numbers of 'x' within the groups established using by
.
For example, given...
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
...I would like to know the indices into DT for each value of y.
The value to me is that I am using a data.table in parallel with Another Data Structure (ADS) to which I intend to perform groupwise computations based on the efficiently computed groupings of the data.table.
For example, assuming ADS is a vector with a value for each row in DT:
ADS<-sample(100,nrow(DT))
I can, as a workaround, compute the groupwise mean of ADS determined by DT$y the group if I first add a new sequence column to the data.table.
DT[,seqNum:=seq_len(nrow(DT))]
DT[,mean(ADS[seqNum]),by=y]
Which gives the result I want at the cost of adding a new column.
I realize that in this example I can get the same answer using tapply:
tapply(ADS,DT$y,mean)
However, I will not then get the performance benefit of data.tables efficient grouping (especially when the 'by' columns are indexed).
Perhaps there is some syntax I am overlooking???
Perhaps this is an easy feature to add to data.table and I should request it (wink, wink)???
Proposed syntax: optionally set '.which' to the group indices, allowing to write:
DT[,mean(ADS[.which]),by=y,which=TRUE]
For example, if we have a data frame called df that contains a value say Data then we can find the row and column index of Data by using the command as which(df=="Data",arr. ind=TRUE).
Available since data.table
1.8.3 you can use .I
in the j
of a data.table
to get the row indices by groups...
DT[ , list( yidx = list(.I) ) , by = y ]
# y yidx
#1: 1 1,4,7
#2: 3 2,5,8
#3: 6 3,6,9
A keyed data.table will be sorted so that groups are stored in contiguous blocks. In that case, you could use .N
to extract the group-wise indexing information:
DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
setkey(DT, y)
ii <- DT[,.N, by=y]
ii[, start := cumsum(N) - N[1] + 1][,end := cumsum(N)][, N := NULL]
# y start end
# 1: 1 1 3
# 2: 3 4 6
# 3: 6 7 9
(Personally, I'd probably just add an indexing column like your suggested seqNum
. Seems simpler, I don't think it will affect performance too much unless you are really pushing the limits.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With