determining row indices of data.table group members

Tags:

data.table

The data.table package in R provides the option:

which: ‘TRUE’ returns the integer row numbers of ‘x’ that ‘i’ matches to.

However, I see no way of obtaining, within j, the integer row numbers of 'x' within the groups established using by.

For example, given...

DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))

...I would like to know the indices into DT for each value of y.

The value to me is that I am using a data.table in parallel with Another Data Structure (ADS) to which I intend to perform groupwise computations based on the efficiently computed groupings of the data.table.

For example, assuming ADS is a vector with a value for each row in DT:

ADS<-sample(100,nrow(DT))

I can, as a workaround, compute the groupwise mean of ADS determined by DT$y the group if I first add a new sequence column to the data.table.

DT[,seqNum:=seq_len(nrow(DT))]
DT[,mean(ADS[seqNum]),by=y]

Which gives the result I want at the cost of adding a new column.

I realize that in this example I can get the same answer using tapply:

tapply(ADS,DT$y,mean)

However, I will not then get the performance benefit of data.tables efficient grouping (especially when the 'by' columns are indexed).

Perhaps there is some syntax I am overlooking???

Perhaps this is an easy feature to add to data.table and I should request it (wink, wink)???

Proposed syntax: optionally set '.which' to the group indices, allowing to write:

DT[,mean(ADS[.which]),by=y,which=TRUE]

253

asked Sep 13 '12 17:09

malcook

2 Answers

Available since data.table 1.8.3 you can use .I in the j of a data.table to get the row indices by groups...

DT[ , list( yidx = list(.I) ) , by = y ]
#   y  yidx
#1: 1 1,4,7
#2: 3 2,5,8
#3: 6 3,6,9

answered Sep 19 '22 12:09

Simon O'Hanlon

A keyed data.table will be sorted so that groups are stored in contiguous blocks. In that case, you could use .N to extract the group-wise indexing information:

DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
setkey(DT, y)

ii <- DT[,.N, by=y]
ii[, start := cumsum(N) - N[1] + 1][,end := cumsum(N)][, N := NULL]
#    y start end
# 1: 1     1   3
# 2: 3     4   6
# 3: 6     7   9

(Personally, I'd probably just add an indexing column like your suggested seqNum. Seems simpler, I don't think it will affect performance too much unless you are really pushing the limits.)

answered Sep 19 '22 12:09

Josh O'Brien

Related questions
                            
                                How to cope with a singular fit in a linear mixed model (lme4)?
                            
                                I am trying to install openssl package in R using Ubuntu 18.04 without success
                            
                                Can you use dplyr across() to iterate across pairs of columns?
                            
                                Calculate within and between variances and confidence intervals in R
                            
                                GLM with autoregressive term to correct for serial correlation
                            
                                Adding trend lines/boxplots (by group) in ggplot2
                            
                                Levels in R Dataframe
                            
                                S4 missing or NULL arguments to methods?
                            
                                multiple histograms on top of eachother without bins
                            
                                Write a function to remove object if it exists
                            
                                Chi Square Analysis using for loop in R
                            
                                How to factorize specific columns in a data.frame in R using apply
                            
                                "Cannot open the connection" - HPC in R with snow
                            
                                Efficiently locate group-wise constant columns in a data.frame
                            
                                R Regular Expression Lookbehind
                            
                                How do I retrieve a simple numeric value from a named numeric vector in R?
                            
                                R - Speeding up approximate date match. idata.frame?
                            
                                how to write micrometer squared per cubic meter in plot label in R
                            
                                geom_vline with Character xintercept
                            
                                frequency table with several variables in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With