Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R : data.table subsetting based on a integer column

Just wondering if there is a slicker way to subset a data.table. Basically I have a big table with millionish rows and hundreds cols. I want to subset it based on an integer col/s having a value between a range defined by me.

I was wondering if the set the relevant column as the Key it would be binary search but then not sure if I can find the rows between a range of values.

Contrived example below.

> n = 1e7
> dt <- data.table(a=rnorm(n),b=sample(letters,replace=T,n))
> system.time(subset(dt, a > 1 & a < 2))
   user  system elapsed 
  1.596   0.000   1.596
> system.time(dt[a %between% c(1,2)])
   user  system elapsed 
  1.168   0.000   1.168 

can something like this be done?

setkey(dt,a)
dt[  ] : get me the rows between 1 and 2 values of the key

Thanks! -Abhi

like image 271
Abhi Avatar asked Dec 16 '13 21:12

Abhi


People also ask

How do you subset a DataFrame based on columns in R?

The most general way to subset a data frame by rows and/or columns is the base R Extract[] function, indicated by matched square brackets instead of the usual matched parentheses.

How do I subset a value in a column in R?

To subset columns use select argument with values as column names to subset() .

How do I select certain columns in a Datatable in R?

One can use this function to, for example, select columns if they are numeric. Helper functions - starts_with(), ends_with(), contains(), matches(), one_of(): Select columns/variables based on their names.


2 Answers

If you do set the key on a (which will take some time (14.7 seconds on my machine for n=1e7), then you can use rolling joins to identify the start and end of your region of interest.

# thus the following will work. 
dt[seq.int(dt[.(1),.I,roll=-1]$.I, dt[.(2), .I, roll=1]$.I)]


n = 1e7
dt <- data.table(a=rnorm(n),b=sample(letters,replace=T,n))
system.time(setkey(dt,a))
#  This  does take some time
# user  system elapsed 
# 14.72    0.00   14.73
library(microbenchmark)
f1 <- function() t1 <- dt[floor(a) == 1]
f2 <-  function() t2 <- dt[a >= 1 & a <= 2]
f3 <- function() {t3 <- dt[seq.int(dt[.(1),.I,roll=-1]$.I, dt[.(2), .I, roll=1]$.I)]   }
microbenchmark(f1(),f2(),f3(), times=10)
# Unit: milliseconds
#  expr       min        lq    median        uq       max neval
#  f1() 371.62161 387.81815 394.92153 403.52299 489.61508    10
#  f2() 529.62952 536.23727 544.74470 631.55594 634.92275    10
#  f3()  65.58094  66.34703  67.04747  75.89296  89.10182    10

It is now "fast", but because we spent time earlier setting the key.

Adding @eddi's approach for benchmarking

 f4 <- function(tolerance = 1e-7){  # adjust according to your needs
  start = dt[J(1 + tolerance), .I[1], roll = -Inf]$V1
  end   = dt[J(2 - tolerance), .I[.N], roll = Inf]$V1
 if (start <= end) dt[start:end]}
 microbenchmark(f1(),f2(),f3(),f4(), times=10)
# Unit: milliseconds
#  expr      min        lq    median        uq       max neval
#  f1() 373.3313 391.07479 440.07025 488.54020 491.48141    10
#  f2() 523.2319 530.11218 533.57844 536.67767 629.53779    10
#  f3()  65.6238  65.71617  66.09967  66.56768  83.27646    10
#  f4()  65.8511  66.26432  66.62096  83.86476  87.01092    10

Eddi's approach is slightly safer as it takes care of floating point tolerance.

like image 146
mnel Avatar answered Nov 09 '22 21:11

mnel


Doing a setkey here would be costly (even if you were to use the fast ordering in 1.8.11), because it has to move the data (by reference) as well.

However, you can get around this case by using floor function. Basically, if you want all the numbers in [1,2] (Note: inclusive of 1 and 2 here), then floor will provide a value of "1" for all these values. That is, you can do:

system.time(t1 <- dt[floor(a) == 1])
#   user  system elapsed 
#  0.234   0.001   0.238 

This is equivalent to doing dt[a >= 1 & a <=2] and is twice as fast.

system.time(t2 <- dt[a >= 1 & a <= 2])
#   user  system elapsed 
#  0.518   0.081   0.601 

identical(t1,t2) # [1] TRUE

However, since you don't want the equality, you can use a hack to subtract the tolerance = .Machine$double.eps^0.5 from column a. If the value is in the range [1, 1+tolerance), then it's still considered to be 1. And if it's just more, then it's not 1 anymore (internally). That is, it's the smallest number > 1 that the machine can identify as not 1. So, if you subtract 'a' by tolerance all numbers that are internally represented as "1" will become < 1 and floor(.) will result in 0. So, you'll get the range > 1 and < 2 instead. That is,

dt[floor(a-.Machine$double.eps^0.5)==1]

will give the equivalent result as dt[a>1 & a<2].


If you've to do this repetitively, then probably creating a new column with this floor function and setting key on that integer column could help:

dt[, fa := as.integer(floor(a-.Machine$double.eps^0.5))]
system.time(setkey(dt, fa)) # v1.8.11
#   user  system elapsed 
#  0.852   0.158   1.043 

Now, you can query whatever range you want using binary search:

> system.time(dt[J(1L)])    # equivalent to > 1 & < 2
#   user  system elapsed 
#  0.071   0.002   0.076 
> system.time(dt[J(1:4)])   # equivalent to > 1 & < 5
#   user  system elapsed 
#  0.082   0.002   0.085 
like image 44
Arun Avatar answered Nov 09 '22 21:11

Arun