R : data.table subsetting based on a integer column

Tags:

Just wondering if there is a slicker way to subset a data.table. Basically I have a big table with millionish rows and hundreds cols. I want to subset it based on an integer col/s having a value between a range defined by me.

I was wondering if the set the relevant column as the Key it would be binary search but then not sure if I can find the rows between a range of values.

Contrived example below.

Click to copy

> n = 1e7
> dt <- data.table(a=rnorm(n),b=sample(letters,replace=T,n))
> system.time(subset(dt, a > 1 & a < 2))
   user  system elapsed 
  1.596   0.000   1.596
> system.time(dt[a %between% c(1,2)])
   user  system elapsed 
  1.168   0.000   1.168

can something like this be done?

Click to copy

setkey(dt,a)
dt[  ] : get me the rows between 1 and 2 values of the key

Thanks! -Abhi

271

asked Dec 16 '13 21:12

Abhi

2 Answers

If you do set the key on a (which will take some time (14.7 seconds on my machine for n=1e7), then you can use rolling joins to identify the start and end of your region of interest.

Click to copy

# thus the following will work. 
dt[seq.int(dt[.(1),.I,roll=-1]$.I, dt[.(2), .I, roll=1]$.I)]


n = 1e7
dt <- data.table(a=rnorm(n),b=sample(letters,replace=T,n))
system.time(setkey(dt,a))
#  This  does take some time
# user  system elapsed 
# 14.72    0.00   14.73
library(microbenchmark)
f1 <- function() t1 <- dt[floor(a) == 1]
f2 <-  function() t2 <- dt[a >= 1 & a <= 2]
f3 <- function() {t3 <- dt[seq.int(dt[.(1),.I,roll=-1]$.I, dt[.(2), .I, roll=1]$.I)]   }
microbenchmark(f1(),f2(),f3(), times=10)
# Unit: milliseconds
#  expr       min        lq    median        uq       max neval
#  f1() 371.62161 387.81815 394.92153 403.52299 489.61508    10
#  f2() 529.62952 536.23727 544.74470 631.55594 634.92275    10
#  f3()  65.58094  66.34703  67.04747  75.89296  89.10182    10

It is now "fast", but because we spent time earlier setting the key.

Adding @eddi's approach for benchmarking

Click to copy

 f4 <- function(tolerance = 1e-7){  # adjust according to your needs
  start = dt[J(1 + tolerance), .I[1], roll = -Inf]$V1
  end   = dt[J(2 - tolerance), .I[.N], roll = Inf]$V1
 if (start <= end) dt[start:end]}
 microbenchmark(f1(),f2(),f3(),f4(), times=10)
# Unit: milliseconds
#  expr      min        lq    median        uq       max neval
#  f1() 373.3313 391.07479 440.07025 488.54020 491.48141    10
#  f2() 523.2319 530.11218 533.57844 536.67767 629.53779    10
#  f3()  65.6238  65.71617  66.09967  66.56768  83.27646    10
#  f4()  65.8511  66.26432  66.62096  83.86476  87.01092    10

Eddi's approach is slightly safer as it takes care of floating point tolerance.

146

answered Nov 09 '22 21:11

mnel

Doing a setkey here would be costly (even if you were to use the fast ordering in 1.8.11), because it has to move the data (by reference) as well.

However, you can get around this case by using floor function. Basically, if you want all the numbers in [1,2] (Note: inclusive of 1 and 2 here), then floor will provide a value of "1" for all these values. That is, you can do:

Click to copy

system.time(t1 <- dt[floor(a) == 1])
#   user  system elapsed 
#  0.234   0.001   0.238

This is equivalent to doing dt[a >= 1 & a <=2] and is twice as fast.

Click to copy

system.time(t2 <- dt[a >= 1 & a <= 2])
#   user  system elapsed 
#  0.518   0.081   0.601 

identical(t1,t2) # [1] TRUE

However, since you don't want the equality, you can use a hack to subtract the tolerance = .Machine$double.eps^0.5 from column a. If the value is in the range [1, 1+tolerance), then it's still considered to be 1. And if it's just more, then it's not 1 anymore (internally). That is, it's the smallest number > 1 that the machine can identify as not 1. So, if you subtract 'a' by tolerance all numbers that are internally represented as "1" will become < 1 and floor(.) will result in 0. So, you'll get the range > 1 and < 2 instead. That is,

Click to copy

dt[floor(a-.Machine$double.eps^0.5)==1]

will give the equivalent result as dt[a>1 & a<2].

If you've to do this repetitively, then probably creating a new column with this floor function and setting key on that integer column could help:

Click to copy

dt[, fa := as.integer(floor(a-.Machine$double.eps^0.5))]
system.time(setkey(dt, fa)) # v1.8.11
#   user  system elapsed 
#  0.852   0.158   1.043

Now, you can query whatever range you want using binary search:

Click to copy

> system.time(dt[J(1L)])    # equivalent to > 1 & < 2
#   user  system elapsed 
#  0.071   0.002   0.076 
> system.time(dt[J(1:4)])   # equivalent to > 1 & < 5
#   user  system elapsed 
#  0.082   0.002   0.085

answered Nov 09 '22 21:11

Arun

Related questions
                            
                                how can I evaluate the derivative of a spline function in R?
                            
                                How can I rank observations in-group faster?
                            
                                Generating a time series with a specific start and end date
                            
                                PDF scraping using R
                            
                                Write list to a text file, preserving names, R
                            
                                Subsetting a vector using another boolean vector in R
                            
                                plot multiple shp file on a graph using spplot in R
                            
                                R, get key from key value (hash)
                            
                                ggplot2: Set alpha=0 for certain points depending on fill value
                            
                                Sequence construction that creates an empty sequence if lower is greater than upper bound
                            
                                subset a vector and sort it
                            
                                R - merge lists with overwrite and recursion
                            
                                Why associative array of R is called list and not map/dictionary [closed]
                            
                                Remove duplicate observations based on set of rules
                            
                                How to compute the power of a matrix in R [duplicate]
                            
                                Exact axis ticks and labels in R Lattice xyplot
                            
                                ggplot specific thick line
                            
                                g++ errors when trying to compile c++11 with Rcpp
                            
                                How to set line width and color when plotting a shapefile with plot()
                            
                                Removing duplicate words in a string in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R : data.table subsetting based on a integer column

Tags:

dataframe

r

data.table

Abhi

People also ask

2 Answers

mnel

Arun

Recent Activity

Donate For Us