How to define data.table keys for fastest aggregation using multiple keys

Tags:

I am trying to better understand utilizing keyd data.tables. After reading the documentation I think I understand how to speed up subsetting when using one key. For example:

DT = data.table(x=rep(c("ad","bd","cd"),each=3), y=c(1,3,6), v=1:9)

Option one:

DT[x == "ad"]

Option two:

setkey(DT,x)
DT["ad"]

In this case option one is much slower than option two, because the data.table uses the key to seach more efficiently (using a binary search vs. a vector scan, which I do not understand but I will trust is faster.)

In the case of aggregating on a subset of the data using a by statement, what is the fastest way to define the key? Should I key the column that I am using to subset the data, or the column that defines the groups? For example:

setkey(DT,x)
DT[!"bd",sum(v),by=y]

setkey(DT,y)
DT[!"bd",sum(v),by=y]

Is there a way to utilize a key for both x and y?

EDIT

Does setting the key to both x and y perform two vector searches? i.e:

setkey(DT,x,y)

EDIT2

Sorry, what I meant to ask was will the call DT[!"bd",sum(v),by=y] perform two binary scans when DT is keyed by both x and y?

558

asked Nov 14 '13 19:11

dayne

1 Answers

I believe it is not possible to perform two binary scans when the data table DT is keyed by both x and y. Instead I would repeat the keying first on x and then on y as follows:

DT = data.table(x=rep(c("ad","bd","cd"),each=3), y=as.character(c(1,3,4)), v=1:9)
setkey(DT,x)
tmp = DT[!"bd"]
setkey(tmp,y)
tmp[!"1",sum(v),by=y]

answered Nov 13 '22 11:11

MasterJedi

Related questions
                            
                                What setup is need to compile rpy2 on Windows?
                            
                                How to cleanly label the points in a simple ggplot2 scatterplot?
                            
                                Plot a tree diagram from a list in R
                            
                                How to properly dput internationalized text?
                            
                                Assigning output of a function to two variables in R [duplicate]
                            
                                Make R use C notation when escaping terminals
                            
                                Vertical white lines when plotting heatmap in TIFF
                            
                                R DOLS (Dynamic Ordinary Least Squares) packages
                            
                                capturing pipe exit status in R
                            
                                Pooling Cox PH results after multiple imputation with the MICE package
                            
                                using multiple size scales in a ggplot
                            
                                LARGE covariance matrix in R
                            
                                R graph degree.distribution not working
                            
                                Segment annotation on log10 scale works differently for the end and the beginning of the segment?
                            
                                How to keep using R version 2.x and download packages automatically with install.packages() by package name?
                            
                                Copy files while preserving original file information (creation time etc.)
                            
                                Translating time stamps (start, end) into time series data. Errors with align.time() and colnames
                            
                                data.table assignment involving factors
                            
                                R: Generic Function to Uncompress Files
                            
                                Two chunks side by side with knitr markdown

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to define data.table keys for fastest aggregation using multiple keys

Tags:

performance

r

data.table

compound-key

dayne

People also ask

1 Answers

MasterJedi

Recent Activity

Donate For Us