It seems to me the fastest way to do a row/col subset of a data.table
is to use the join and nomatch
option.
Is this correct?
DT = data.table(rep(1:100, 100000), rep(1:10, 1000000))
setkey(DT, V1, V2)
system.time(DT[J(22,2), nomatch=0L])
# user system elapsed
# 0.00 0.00 0.01
system.time(subset(DT, (V1==22) & (V2==2)))
# user system elapsed
# 0.45 0.21 0.67
identical(DT[J(22,2), nomatch=0L],subset(DT, (V1==22) & (V2==2)))
# [1] TRUE
I also have one problem with the fast join based on binary search: I cannot find a way to select all items in one dimension.
Say if I want to subsequently do:
DT[J(22,2), nomatch=0] # subset on TWO dimensions
DT[J(22,), nomatch=0] # subset on ONE dimension only
# Error in list(22, ) : argument 2 is empty
without having to re-set the key to only one dimension (because I am in a loop and I don't want to rest the keys every time).
data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.
data.table
?Using the binary search based subset feature is the fastest. Note that the subset requires the option nomatch = 0L
so as to return only the matching results.
If you've two keys set on DT
and you want to subset by the first key, then you can just provide the first value in J(.)
, no need to provide anything for the 2nd key. That is:
# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L]
If instead, you would like to subset by the second key, then you'll have to, as of now, provide all the unique values for the first key. That is:
# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]
This is also shown in this SO post. Although I'd prefer that DT[J(, 2)]
to work for this case, as that seems rather intuitive.
There's also a pending feature request, FR #1007 for implementing secondary keys, which when done would take care of this.
Here is a better example:
DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2))
DT
# V1 V2
# 1: 1 2
# 2: 2 3
# 3: 3 2
# 4: 4 3
# 5: 5 2
setkey(DT,V1,V2)
DT[J(unique(V1),2)]
# V1 V2
# 1: 1 2
# 2: 2 2
# 3: 3 2
# 4: 4 2
# 5: 5 2
DT[J(unique(V1),2), nomatch=0L]
# V1 V2
# 1: 1 2
# 2: 3 2
# 3: 5 2
DT[J(3), nomatch=0L]
# V1 V2
# 1: 3 2
In summary:
# key(DT) = c("V1", "V2")
# data.frame | data.table equivalent
# =====================================================================
# subset(DF, (V1 == 3) & (V2 == 2)) | DT[J(3,2), nomatch=0L]
# subset(DF, (V1 == 3)) | DT[J(3), nomatch=0L]
# subset(DF, (V2 == 2)) | DT[J(unique(V1), 2), nomatch=0L]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With