Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan

Tags:

r

data.table

I recently discovered binary search in data.table. If the table is sorted on multiple keys it possible to search on the 2nd key only ?

DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
#R> DT[J('x')]
#        x  y       V3
#     1: x  1  0.89109
#     2: x  1 -2.01457
#    ---              
#384922: x 25  0.09676
#384923: x 25  0.25168
#R> DT[J('x',3)]
#       x y       V3
#    1: x 3 -0.88165
#    2: x 3  1.51028
#   ---             
#15383: x 3 -1.62218
#15384: x 3 -0.63601

EDIT: thanks to @Arun

R> system.time(DT[J(unique(x), 25)])
   user  system elapsed 
  0.220   0.068   0.288 
R> system.time(DT[y==25])
   user  system elapsed 
  0.268   0.092   0.359

749

asked Mar 24 '13 11:03

statquant

2 Answers

Yes, you can pass all values to the first key value and subset with the specific value for the second key.

DT[J(unique(x), 25), nomatch=0]

If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]), a more general solution is to use CJ

DT[CJ(unique(x), 25:24), nomatch=0]

Note that CJ by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should use sorted=FALSE

DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]

There's also a feature request to add secondary keys to data.table in future. I believe the plan is to add a new function set2key.

FR#1007 Build in secondary keys

There is also merge, which has a method for data.table. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table.

answered Oct 27 '22 08:10

Arun

Based on this email thread I wrote the following functions:

create_index = function(dt, ..., verbose = getOption("datatable.verbose")) {
  cols = data.table:::getdots()
  res = dt[, cols, with=FALSE]
  res[, i:=1:nrow(dt)]
  setkeyv(res, cols, verbose = verbose)
}

JI = function(index, ...) {
  index[J(...),i]$i
}

Here are the results on my system with a larger DT (1e8 rows):

> system.time(DT[J("c")])
   user  system elapsed 
  0.168   0.136   0.306 

> system.time(DT[J(unique(x), 25)])
   user  system elapsed 
  2.472   1.508   3.980 
> system.time(DT[y==25])
   user  system elapsed 
  4.532   2.149   6.674 

> system.time(IDX_y <- create_index(DT, y))
   user  system elapsed 
  3.076   2.428   5.503 
> system.time(DT[JI(IDX_y, 25)])
   user  system elapsed 
  0.512   0.320   0.831

If you are using the index multiple times it is worth it.

answered Oct 27 '22 07:10

unique2

Related questions
                            
                                Using source() within parallel foreach loops
                            
                                Conditional panel in Shiny dashboard
                            
                                R: converting each row of a data frame into a list item
                            
                                In R data.table, how do I pass variable parameters to an expression?
                            
                                Large Matrices in R: long vectors not supported yet
                            
                                GBM R function: get variable importance separately for each class
                            
                                Use pipe without feeding first argument
                            
                                How to apply geom_smooth() for every group?
                            
                                No RTools compatible with R version 3.5.0 was found
                            
                                Summarise to return the length by group
                            
                                R crashing while displaying ggplot after update (process memory read out of range)
                            
                                How to escape % in roxygen literate programming?
                            
                                Aggregate by factor levels, keeping other variables in the resulting data frame
                            
                                Customise x-axis ticks
                            
                                How to remove rows with 0 values using R
                            
                                Fast partial string matching in R
                            
                                Shrink DT::dataTableOutput Size
                            
                                command line arguments in bash to Rscript
                            
                                R equivalent to MATLAB's "stop if error"
                            
                                Why are " preferred over ' in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With