Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to define data.table keys for fastest aggregation using multiple keys

I am trying to better understand utilizing keyd data.tables. After reading the documentation I think I understand how to speed up subsetting when using one key. For example:

DT = data.table(x=rep(c("ad","bd","cd"),each=3), y=c(1,3,6), v=1:9)

Option one:

DT[x == "ad"]

Option two:

setkey(DT,x)
DT["ad"]

In this case option one is much slower than option two, because the data.table uses the key to seach more efficiently (using a binary search vs. a vector scan, which I do not understand but I will trust is faster.)

In the case of aggregating on a subset of the data using a by statement, what is the fastest way to define the key? Should I key the column that I am using to subset the data, or the column that defines the groups? For example:

setkey(DT,x)
DT[!"bd",sum(v),by=y]

or

setkey(DT,y)
DT[!"bd",sum(v),by=y]

Is there a way to utilize a key for both x and y?

EDIT

Does setting the key to both x and y perform two vector searches? i.e:

setkey(DT,x,y)

EDIT2

Sorry, what I meant to ask was will the call DT[!"bd",sum(v),by=y] perform two binary scans when DT is keyed by both x and y?

like image 558
dayne Avatar asked Nov 14 '13 19:11

dayne


People also ask

What does setkey mean in R?

Description. setkey sorts a data. table and marks it as sorted with an attribute sorted . The sorted columns are the key. The key can be any number of columns.

What are data tables?

A data table is a range of cells in which you can change values in some of the cells and come up with different answers to a problem. A good example of a data table employs the PMT function with different loan amounts and interest rates to calculate the affordable amount on a home mortgage loan.


1 Answers

I believe it is not possible to perform two binary scans when the data table DT is keyed by both x and y. Instead I would repeat the keying first on x and then on y as follows:

DT = data.table(x=rep(c("ad","bd","cd"),each=3), y=as.character(c(1,3,4)), v=1:9)
setkey(DT,x)
tmp = DT[!"bd"]
setkey(tmp,y)
tmp[!"1",sum(v),by=y]
like image 90
MasterJedi Avatar answered Nov 13 '22 11:11

MasterJedi