I am trying to subset a data.table ( from the package data.table ) in R (not a data.frame). I have a 4 digit year as a key. I would like to subset by taking a series of years. For example, I want to pull all the records that are from 1999, 2000, 2001.
I have tried passing in my DT[J(year)]
binary search syntax the following:
1999,2000,2001
c(1999,2000,2001)
1999, 2000, 2001
but none of these seem to work. Anyone know how to do a subset where the years you want to select are not just 1 but multiple years?
Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.
What is data subsetting? Test data subsetting is extracting a smaller sized – referential intact – set of data from a 'production' database to a non-production environment.
A Row Subset is a selection of the rows within a whole table being viewed within the application, or equivalently a new table composed from some subset of its rows.
What works for data.frame
s works for data.table
s.
subset(DT, year %in% 1999:2001)
The question is not clear and does not provide sufficient data to work with BUT it is usefull, so if some one can edit it with the data I provide hereafter, one is welcome. The title of the post could also be completed : Matthew Dowle often answers the subsetting-over-two-vectors question, but less frequently the subsetting-according-a-in-statement-on-one-vector one. I have been looking a while for an answer, untill finding one for character vectors here.
Let's consider this data :
library(data.table)
n <- 100
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)
The data.table-style query corresponding to X[X$a %in% c(10,20),]
is somehow surprising :
setkey(X,a)
X[.(c(10,20))]
X[.(10,20)] # works for characters but not for integers
# instead, treats 10 as the filter
# and 20 as a new variable
# for comparison :
X[X$a %in% c(10,20),]
Now, which is best? If your key is already set, data.table, obviously. Otherwise, it might not, as prove the following time-measurements (on my 1,75 Go RAM computer) :
n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)
system.time(X[X$a %in% c(10,20),])
# utilisateur système écoulé (yes, I'm French)
# 1.92 0.06 1.99
system.time(setkey(X,a))
# utilisateur système écoulé
# 34.91 0.05 35.23
system.time(X[J(c(10,20))])
# utilisateur système écoulé
# 0.15 0.08 0.23
But maybe Matthew has better solutions...
[Matthew] You've discovered that sorting type numeric
(a.k.a. double
) is much slower than integer
. For many years we didn't allow double
in keys for fear of users falling into this trap and reporting terrible timings like this. We allowed double
in keys with some trepidation because fast sorting isn't implemented for double
yet. Fast sorting on integer
and character
is pretty good because those are done using a counting sort. Hopefully we'll get to fast sorting (Now implemented - see below).numeric
one day!
n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)
system.time(setkey(X,a))
# user system elapsed
# 13.898 0.138 14.216
X <- data.table(a=sample(as.integer(c(10,20,25,30,40)),n,replace=TRUE),b=1:n)
system.time(setkey(X,a))
# user system elapsed
# 0.381 0.019 0.408
Rememeber that 2
is type numeric
in R by default. 2L
is integer
. Although data.table
accepts numeric
it still much prefers integer
.
Fast radix sort for numerics is implemented since v1.9.0.
n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)
system.time(setkey(X,a))
# user system elapsed
# 0.832 0.026 0.871
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With