Just wondering if there is a slicker way to subset a data.table. Basically I have a big table with millionish rows and hundreds cols. I want to subset it based on an integer col/s having a value between a range defined by me.
I was wondering if the set the relevant column as the Key it would be binary search but then not sure if I can find the rows between a range of values.
Contrived example below.
> n = 1e7
> dt <- data.table(a=rnorm(n),b=sample(letters,replace=T,n))
> system.time(subset(dt, a > 1 & a < 2))
user system elapsed
1.596 0.000 1.596
> system.time(dt[a %between% c(1,2)])
user system elapsed
1.168 0.000 1.168
can something like this be done?
setkey(dt,a)
dt[ ] : get me the rows between 1 and 2 values of the key
Thanks! -Abhi
The most general way to subset a data frame by rows and/or columns is the base R Extract[] function, indicated by matched square brackets instead of the usual matched parentheses.
To subset columns use select argument with values as column names to subset() .
One can use this function to, for example, select columns if they are numeric. Helper functions - starts_with(), ends_with(), contains(), matches(), one_of(): Select columns/variables based on their names.
If you do set the key on a
(which will take some time (14.7 seconds on my machine for n=1e7
),
then you can use rolling joins to identify the start and end of your region of interest.
# thus the following will work.
dt[seq.int(dt[.(1),.I,roll=-1]$.I, dt[.(2), .I, roll=1]$.I)]
n = 1e7
dt <- data.table(a=rnorm(n),b=sample(letters,replace=T,n))
system.time(setkey(dt,a))
# This does take some time
# user system elapsed
# 14.72 0.00 14.73
library(microbenchmark)
f1 <- function() t1 <- dt[floor(a) == 1]
f2 <- function() t2 <- dt[a >= 1 & a <= 2]
f3 <- function() {t3 <- dt[seq.int(dt[.(1),.I,roll=-1]$.I, dt[.(2), .I, roll=1]$.I)] }
microbenchmark(f1(),f2(),f3(), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# f1() 371.62161 387.81815 394.92153 403.52299 489.61508 10
# f2() 529.62952 536.23727 544.74470 631.55594 634.92275 10
# f3() 65.58094 66.34703 67.04747 75.89296 89.10182 10
It is now "fast", but because we spent time earlier setting the key.
Adding @eddi's approach for benchmarking
f4 <- function(tolerance = 1e-7){ # adjust according to your needs
start = dt[J(1 + tolerance), .I[1], roll = -Inf]$V1
end = dt[J(2 - tolerance), .I[.N], roll = Inf]$V1
if (start <= end) dt[start:end]}
microbenchmark(f1(),f2(),f3(),f4(), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# f1() 373.3313 391.07479 440.07025 488.54020 491.48141 10
# f2() 523.2319 530.11218 533.57844 536.67767 629.53779 10
# f3() 65.6238 65.71617 66.09967 66.56768 83.27646 10
# f4() 65.8511 66.26432 66.62096 83.86476 87.01092 10
Eddi's approach is slightly safer as it takes care of floating point tolerance.
Doing a setkey
here would be costly (even if you were to use the fast ordering in 1.8.11
), because it has to move the data (by reference) as well.
However, you can get around this case by using floor
function. Basically, if you want all the numbers in [1,2] (Note: inclusive of 1 and 2 here), then floor
will provide a value of "1" for all these values. That is, you can do:
system.time(t1 <- dt[floor(a) == 1])
# user system elapsed
# 0.234 0.001 0.238
This is equivalent to doing dt[a >= 1 & a <=2]
and is twice as fast.
system.time(t2 <- dt[a >= 1 & a <= 2])
# user system elapsed
# 0.518 0.081 0.601
identical(t1,t2) # [1] TRUE
However, since you don't want the equality, you can use a hack to subtract the tolerance = .Machine$double.eps^0.5
from column a
. If the value is in the range [1, 1+tolerance)
, then it's still considered to be 1. And if it's just more, then it's not 1 anymore (internally). That is, it's the smallest number > 1 that the machine can identify as not 1. So, if you subtract 'a' by tolerance all numbers that are internally represented as "1" will become < 1 and floor(.)
will result in 0. So, you'll get the range > 1 and < 2 instead. That is,
dt[floor(a-.Machine$double.eps^0.5)==1]
will give the equivalent result as dt[a>1 & a<2]
.
If you've to do this repetitively, then probably creating a new column with this floor
function and setting key on that integer
column could help:
dt[, fa := as.integer(floor(a-.Machine$double.eps^0.5))]
system.time(setkey(dt, fa)) # v1.8.11
# user system elapsed
# 0.852 0.158 1.043
Now, you can query whatever range you want using binary search:
> system.time(dt[J(1L)]) # equivalent to > 1 & < 2
# user system elapsed
# 0.071 0.002 0.076
> system.time(dt[J(1:4)]) # equivalent to > 1 & < 5
# user system elapsed
# 0.082 0.002 0.085
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With