Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

determining row indices of data.table group members

Tags:

r

data.table

The data.table package in R provides the option:

which: ‘TRUE’ returns the integer row numbers of ‘x’ that ‘i’ matches to.

However, I see no way of obtaining, within j, the integer row numbers of 'x' within the groups established using by.

For example, given...

DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))

...I would like to know the indices into DT for each value of y.

The value to me is that I am using a data.table in parallel with Another Data Structure (ADS) to which I intend to perform groupwise computations based on the efficiently computed groupings of the data.table.

For example, assuming ADS is a vector with a value for each row in DT:

ADS<-sample(100,nrow(DT))

I can, as a workaround, compute the groupwise mean of ADS determined by DT$y the group if I first add a new sequence column to the data.table.

DT[,seqNum:=seq_len(nrow(DT))]
DT[,mean(ADS[seqNum]),by=y]

Which gives the result I want at the cost of adding a new column.

I realize that in this example I can get the same answer using tapply:

tapply(ADS,DT$y,mean)

However, I will not then get the performance benefit of data.tables efficient grouping (especially when the 'by' columns are indexed).

Perhaps there is some syntax I am overlooking???

Perhaps this is an easy feature to add to data.table and I should request it (wink, wink)???

Proposed syntax: optionally set '.which' to the group indices, allowing to write:

DT[,mean(ADS[.which]),by=y,which=TRUE]
like image 253
malcook Avatar asked Sep 13 '12 17:09

malcook


People also ask

How do I find row and column index?

For example, if we have a data frame called df that contains a value say Data then we can find the row and column index of Data by using the command as which(df=="Data",arr. ind=TRUE).


2 Answers

Available since data.table 1.8.3 you can use .I in the j of a data.table to get the row indices by groups...

DT[ , list( yidx = list(.I) ) , by = y ]
#   y  yidx
#1: 1 1,4,7
#2: 3 2,5,8
#3: 6 3,6,9
like image 98
Simon O'Hanlon Avatar answered Sep 19 '22 12:09

Simon O'Hanlon


A keyed data.table will be sorted so that groups are stored in contiguous blocks. In that case, you could use .N to extract the group-wise indexing information:

DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
setkey(DT, y)

ii <- DT[,.N, by=y]
ii[, start := cumsum(N) - N[1] + 1][,end := cumsum(N)][, N := NULL]
#    y start end
# 1: 1     1   3
# 2: 3     4   6
# 3: 6     7   9

(Personally, I'd probably just add an indexing column like your suggested seqNum. Seems simpler, I don't think it will affect performance too much unless you are really pushing the limits.)

like image 29
Josh O'Brien Avatar answered Sep 19 '22 12:09

Josh O'Brien