I am trying to subset a data.table ( from the package data.table ) in R (not a data.frame). I have a 4 digit year as a key. I would like to subset by taking a series of years. For example, I want to pull all the records that are from 1999, 2000, 2001. I have tried passing in my <code>DT[J(year)]</code> binary search syntax the following: <pre class="prettyprint"><code>1999,2000,2001 c(1999,2000,2001) 1999, 2000, 2001 </code></pre> but none of these seem to work. Anyone know how to do a subset where the years you want to select are not just 1 but multiple years?

What works for <code>data.frame</code>s works for <code>data.table</code>s. <pre class="prettyprint"><code>subset(DT, year %in% 1999:2001) </code></pre>

The question is not clear and does not provide sufficient data to work with BUT it is usefull, so if some one can edit it with the data I provide hereafter, one is welcome. The title of the post could also be completed : Matthew Dowle often answers the subsetting-over-two-vectors question, but less frequently the subsetting-according-a-in-statement-on-one-vector one. I have been looking a while for an answer, untill finding one for character vectors here. Let's consider this data : <pre class="prettyprint"><code>library(data.table) n <- 100 X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n) </code></pre> The data.table-style query corresponding to <code>X[X$a %in% c(10,20),]</code> is somehow surprising : <pre class="prettyprint"><code>setkey(X,a) X[.(c(10,20))] X[.(10,20)] # works for characters but not for integers # instead, treats 10 as the filter # and 20 as a new variable # for comparison : X[X$a %in% c(10,20),] </code></pre> Now, which is best? If your key is already set, data.table, obviously. Otherwise, it might not, as prove the following time-measurements (on my 1,75 Go RAM computer) : <pre class="prettyprint"><code>n <- 1e7 X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n) system.time(X[X$a %in% c(10,20),]) # utilisateur système écoulé (yes, I'm French) # 1.92 0.06 1.99 system.time(setkey(X,a)) # utilisateur système écoulé # 34.91 0.05 35.23 system.time(X[J(c(10,20))]) # utilisateur système écoulé # 0.15 0.08 0.23 </code></pre> But maybe Matthew has better solutions... <hr> [Matthew] You've discovered that sorting type <code>numeric</code> (a.k.a. <code>double</code>) is much slower than <code>integer</code>. For many years we didn't allow <code>double</code> in keys for fear of users falling into this trap and reporting terrible timings like this. We allowed <code>double</code> in keys with some trepidation because fast sorting isn't implemented for <code>double</code> yet. Fast sorting on <code>integer</code> and <code>character</code> is pretty good because those are done using a counting sort. <strike>Hopefully we'll get to fast sorting <code>numeric</code> one day!</strike> (Now implemented - see below). <h3>Timings on data.table pre-1.9.0</h3> <pre class="prettyprint"><code>n <- 1e7 X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n) system.time(setkey(X,a)) # user system elapsed # 13.898 0.138 14.216 X <- data.table(a=sample(as.integer(c(10,20,25,30,40)),n,replace=TRUE),b=1:n) system.time(setkey(X,a)) # user system elapsed # 0.381 0.019 0.408 </code></pre> Rememeber that <code>2</code> is type <code>numeric</code> in R by default. <code>2L</code> is <code>integer</code>. Although <code>data.table</code> accepts <code>numeric</code> it still much prefers <code>integer</code>. <hr> Fast radix sort for numerics is implemented since v1.9.0. <h3>From v1.9.0 on</h3> <pre class="prettyprint"><code>n <- 1e7 X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n) system.time(setkey(X,a)) # user system elapsed # 0.832 0.026 0.871 </code></pre>

subsetting in data.table

Tags:

r

data.table

subset

I am trying to subset a data.table ( from the package data.table ) in R (not a data.frame). I have a 4 digit year as a key. I would like to subset by taking a series of years. For example, I want to pull all the records that are from 1999, 2000, 2001.

I have tried passing in my DT[J(year)] binary search syntax the following:

1999,2000,2001
c(1999,2000,2001)
1999, 2000, 2001

but none of these seem to work. Anyone know how to do a subset where the years you want to select are not just 1 but multiple years?

678

asked Mar 30 '11 13:03

exl

2 Answers

What works for data.frames works for data.tables.

subset(DT, year %in% 1999:2001)

195

answered Oct 16 '22 02:10

Richie Cotton

The question is not clear and does not provide sufficient data to work with BUT it is usefull, so if some one can edit it with the data I provide hereafter, one is welcome. The title of the post could also be completed : Matthew Dowle often answers the subsetting-over-two-vectors question, but less frequently the subsetting-according-a-in-statement-on-one-vector one. I have been looking a while for an answer, untill finding one for character vectors here.

Let's consider this data :

library(data.table)
n <- 100
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)

The data.table-style query corresponding to X[X$a %in% c(10,20),] is somehow surprising :

setkey(X,a)
X[.(c(10,20))]
X[.(10,20)] # works for characters but not for integers
            # instead, treats 10 as the filter
            # and 20 as a new variable

# for comparison :
X[X$a %in% c(10,20),]

Now, which is best? If your key is already set, data.table, obviously. Otherwise, it might not, as prove the following time-measurements (on my 1,75 Go RAM computer) :

n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)
system.time(X[X$a %in% c(10,20),])
# utilisateur     système      écoulé (yes, I'm French) 
#        1.92        0.06        1.99
system.time(setkey(X,a))
# utilisateur     système      écoulé 
#       34.91        0.05       35.23 
system.time(X[J(c(10,20))])
# utilisateur     système      écoulé 
#        0.15        0.08        0.23

But maybe Matthew has better solutions...

[Matthew] You've discovered that sorting type numeric (a.k.a. double) is much slower than integer. For many years we didn't allow double in keys for fear of users falling into this trap and reporting terrible timings like this. We allowed double in keys with some trepidation because fast sorting isn't implemented for double yet. Fast sorting on integer and character is pretty good because those are done using a counting sort. ~~Hopefully we'll get to fast sorting numeric one day!~~ (Now implemented - see below).

Timings on data.table pre-1.9.0

n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)      
system.time(setkey(X,a))
#   user  system elapsed 
# 13.898   0.138  14.216 

X <- data.table(a=sample(as.integer(c(10,20,25,30,40)),n,replace=TRUE),b=1:n)
system.time(setkey(X,a))
#   user  system elapsed 
#  0.381   0.019   0.408

Rememeber that 2 is type numeric in R by default. 2L is integer. Although data.table accepts numeric it still much prefers integer.

Fast radix sort for numerics is implemented since v1.9.0.

From v1.9.0 on

n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)      
system.time(setkey(X,a))
#    user  system elapsed 
#   0.832   0.026   0.871

answered Oct 16 '22 02:10

Arthur

Related questions
                            
                                Shinydashboard: Is it not possible to have nested menu sub items? Can't make it work
                            
                                Storing R Objects in a relational database
                            
                                How to change correlation text size in ggpairs()
                            
                                Calculating R^2 for a nonlinear least squares fit
                            
                                Check if string contains ONLY NUMBERS or ONLY CHARACTERS (R)
                            
                                Draggable line chart in R/Shiny
                            
                                Intersecting Points and Polygons in R
                            
                                Overwrite current output in the R console
                            
                                Source-ing an .R script within a function and passing a variable through (RODBC)
                            
                                Check if R package is installed then load library
                            
                                Image smoothing in R
                            
                                Function to extract domain name from URL in R
                            
                                Adding data frames as list elements (using for loop)
                            
                                installing R gsl package on Mac
                            
                                Using un-exported function from another R package?
                            
                                Problems with Downloading pdf file using R
                            
                                ggplot2 draw dashed lines of same colour as solid lines belonging to different groups
                            
                                Sort a named list in R
                            
                                Transposing data frames
                            
                                When using ggplot in R, how do I remove margins surrounding the plot area?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With