Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subsetting data.table by not head(key(DT),m), using binary search not vector scan

Tags:

r

data.table

If I specify n columns as a key of a data.table, I'm aware that I can join to fewer columns than are defined in that key as long as I join to the head of key(DT). For example, for n=2 :

X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
    A B
 1: 1 1
 2: 1 1
 3: 2 1
 4: 2 1
 5: 3 1
 6: 3 2
 7: 4 2
 8: 4 2
 9: 5 2
10: 5 2

X[J(3)]
   A B
1: 3 1
2: 3 2

There I only joined to the first column of the 2-column key of DT. I know I can join to both columns of the key like this :

X[J(3,1)]
   A B
1: 3 1

But how do I subset using only the second column colum of the key (e.g. B==2), but still using binary search not vector scan? I'm aware that's a duplicate of :

Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan

so I'd like to generalise this question to n. My data set has about a million rows and solution provided in dup question linked above doesn't seem to be optimal.

like image 534
CHP Avatar asked Apr 02 '13 16:04

CHP


2 Answers

Here is a simple function that will extract the correct unique values and return a data table to use as a key.

 X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5), 
                  C = letters[1:20], key=c('A','B','C'))
 make.key <- function(ddd, what){
  # the names of the key columns
  zzz <- key(ddd)
  # the key columns you wish to keep all unique values
  whichUnique <- setdiff(zzz, names(what))
  ## unique data.table (when keyed); .. means "look up one level"
  ud <-  lapply([, ..whichUnique], unique)
  ## append the `what` columns and  a Cross Join of the new
  ## key columns
  do.call(CJ, c(ud,what)[zzz])
}   

X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b

I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.

like image 132
mnel Avatar answered Nov 17 '22 00:11

mnel


Adding secondary keys is on the feature request list :

FR#1007 Build in secondary keys

In the meantime we are stuck with either vector scan, or the approach used in the answer to the n=2 case linked in the question (which @mnel generalises nicely in his answer).

like image 38
Matt Dowle Avatar answered Nov 16 '22 23:11

Matt Dowle