I'm a bit surprised by how data.table
works:
> library(data.table)
data.table 1.8.2 For help type: help("data.table")
> dt <- data.table(a=11:20, b=21:30, c=31:40, key="a")
> dt[list(12)]
a b c
1: 12 22 32
> dt[list(12), b]
a b
1: 12 22
> dt[list(12)][,b]
[1] 22
What I'm trying to do is obtain the value of a single column (or expression) in rows matched by a selection. I see that I've got to pass the key as a list as a raw number would indicate a row number and not a key value. So the first of the above is clear to me. But why the second and the thirs subset expression yield different results seems rather confusing to me. I'd like to get the third result, but would exect being able to write it the second way.
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result? Is there a syntactically shorter way to obtain a single result except by double subsetting as above?
I'm using data.table 1.8.2 on R 2.15.1. If you cannot reproduce my example, you might as well consider a factor as key:
dt <- data.table(a=paste("a", 11:20, sep=""), b=21:30, c=31:40, key="a")
dt["a11", b]
Regarding this question:
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result?
I believe that the (good enough for me) reason is simply that Matthew Dowle hasn't yet gotten around to adding that option (likely because he has prioritized work on much more useful features such as ":= with by").
In comments following my answer here, Matthew seemed to indicate that it is on his TODO list, noting that "[this] is what drop=TRUE
will do (with a speed advantage) when drop
is added".
Until then, any of the following will get the job done:
dt[list(12)][,b]
# [1] 22
dt[list(12)][[2]]
# [1] 22
dt[dt[list(12), which=TRUE], b]
# [1] 22
One possibility is to use:
dt[a == 12]
and
dt[a == 12, b]
This will work as expected, but it prevents binary search and requires sequential search instead (is there a plan to change this behavior ??), making it potentially slower.
UPDATE Sep 2014: now in v1.9.3
From NEWS :
DT[column==values]
is now optimized to useDT
's key whenkey(DT)[1]=="column"
, otherwise a secondary key (a.k.a. index) is automatically added so the nextDT[column==values]
is much faster.DT[column %in% values]
is equivalent; i.e., both==
and%in%
accept vector values. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually usingset2key()
and existence checked usingkey2()
. These optimizations and function names/arguments are experimental and may be turned off withoptions(datatable.auto.index=FALSE)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With