Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way to subset a data.table?

Tags:

r

data.table

It seems to me the fastest way to do a row/col subset of a data.table is to use the join and nomatch option.

Is this correct?

DT = data.table(rep(1:100, 100000), rep(1:10, 1000000))
setkey(DT, V1, V2)
system.time(DT[J(22,2), nomatch=0L])
# user  system elapsed 
# 0.00    0.00    0.01 
system.time(subset(DT, (V1==22) & (V2==2)))
# user  system elapsed 
# 0.45    0.21    0.67 

identical(DT[J(22,2), nomatch=0L],subset(DT, (V1==22) & (V2==2)))
# [1] TRUE

I also have one problem with the fast join based on binary search: I cannot find a way to select all items in one dimension.

Say if I want to subsequently do:

DT[J(22,2), nomatch=0]  # subset on TWO dimensions
DT[J(22,), nomatch=0]   # subset on ONE dimension only
# Error in list(22, ) : argument 2 is empty

without having to re-set the key to only one dimension (because I am in a loop and I don't want to rest the keys every time).

like image 225
Timothée HENRY Avatar asked May 20 '14 09:05

Timothée HENRY


People also ask

Is data table DT == true?

data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.


Video Answer


1 Answers

What's the fastest way to subset a data.table?

Using the binary search based subset feature is the fastest. Note that the subset requires the option nomatch = 0L so as to return only the matching results.

How to subset by one of the keys only with two keys set?

If you've two keys set on DT and you want to subset by the first key, then you can just provide the first value in J(.), no need to provide anything for the 2nd key. That is:

# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L] 

If instead, you would like to subset by the second key, then you'll have to, as of now, provide all the unique values for the first key. That is:

# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]

This is also shown in this SO post. Although I'd prefer that DT[J(, 2)] to work for this case, as that seems rather intuitive.

There's also a pending feature request, FR #1007 for implementing secondary keys, which when done would take care of this.

Here is a better example:

DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2))
DT
#    V1 V2
# 1:  1  2
# 2:  2  3
# 3:  3  2
# 4:  4  3
# 5:  5  2
setkey(DT,V1,V2)
DT[J(unique(V1),2)]
#    V1 V2
# 1:  1  2
# 2:  2  2
# 3:  3  2
# 4:  4  2
# 5:  5  2
DT[J(unique(V1),2), nomatch=0L]
#    V1 V2
# 1:  1  2
# 2:  3  2
# 3:  5  2
DT[J(3), nomatch=0L]
#    V1 V2
# 1:  3  2

In summary:

# key(DT) = c("V1", "V2")

# data.frame                        |             data.table equivalent
# =====================================================================
# subset(DF, (V1 == 3) & (V2 == 2)) |            DT[J(3,2), nomatch=0L]
# subset(DF, (V1 == 3))             |              DT[J(3), nomatch=0L]
# subset(DF, (V2 == 2))             |  DT[J(unique(V1), 2), nomatch=0L]
like image 199
Timothée HENRY Avatar answered Sep 30 '22 14:09

Timothée HENRY