It seems to me the fastest way to do a row/col subset of a <code>data.table</code> is to use the join and <code>nomatch</code> option. Is this correct? <pre class="prettyprint"><code>DT = data.table(rep(1:100, 100000), rep(1:10, 1000000)) setkey(DT, V1, V2) system.time(DT[J(22,2), nomatch=0L]) # user system elapsed # 0.00 0.00 0.01 system.time(subset(DT, (V1==22) & (V2==2))) # user system elapsed # 0.45 0.21 0.67 identical(DT[J(22,2), nomatch=0L],subset(DT, (V1==22) & (V2==2))) # [1] TRUE </code></pre> I also have one problem with the fast join based on binary search: I cannot find a way to select all items in one dimension. Say if I want to subsequently do: <pre class="prettyprint"><code>DT[J(22,2), nomatch=0] # subset on TWO dimensions DT[J(22,), nomatch=0] # subset on ONE dimension only # Error in list(22, ) : argument 2 is empty </code></pre> without having to re-set the key to only one dimension (because I am in a loop and I don't want to rest the keys every time).

<h3>What's the fastest way to subset a <code>data.table</code>?</h3> Using the binary search based subset feature is the fastest. Note that the subset requires the option <code>nomatch = 0L</code> so as to return only the matching results. <h3>How to subset by one of the keys only with two keys set?</h3> If you've two keys set on <code>DT</code> and you want to subset by the first key, then you can just provide the first value in <code>J(.)</code>, no need to provide anything for the 2nd key. That is: <pre class="prettyprint"><code># will return all columns where the first key column matches 22 DT[J(22), nomatch=0L] </code></pre> If instead, you would like to subset by the second key, then you'll have to, as of now, provide all the unique values for the first key. That is: <pre class="prettyprint"><code># will return all columns where 2nd key column matches 2 DT[J(unique(V1), 2), nomatch=0L] </code></pre> This is also shown in this SO post. Although I'd prefer that <code>DT[J(, 2)]</code> to work for this case, as that seems rather intuitive. There's also a pending feature request, FR #1007 for implementing secondary keys, which when done would take care of this. Here is a better example: <pre class="prettyprint"><code>DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2)) DT # V1 V2 # 1: 1 2 # 2: 2 3 # 3: 3 2 # 4: 4 3 # 5: 5 2 setkey(DT,V1,V2) DT[J(unique(V1),2)] # V1 V2 # 1: 1 2 # 2: 2 2 # 3: 3 2 # 4: 4 2 # 5: 5 2 DT[J(unique(V1),2), nomatch=0L] # V1 V2 # 1: 1 2 # 2: 3 2 # 3: 5 2 DT[J(3), nomatch=0L] # V1 V2 # 1: 3 2 </code></pre> In summary: <pre class="prettyprint"><code># key(DT) = c("V1", "V2") # data.frame | data.table equivalent # ===================================================================== # subset(DF, (V1 == 3) & (V2 == 2)) | DT[J(3,2), nomatch=0L] # subset(DF, (V1 == 3)) | DT[J(3), nomatch=0L] # subset(DF, (V2 == 2)) | DT[J(unique(V1), 2), nomatch=0L] </code></pre>

What's the fastest way to subset a data.table?

Tags:

r

data.table

It seems to me the fastest way to do a row/col subset of a data.table is to use the join and nomatch option.

Is this correct?

DT = data.table(rep(1:100, 100000), rep(1:10, 1000000))
setkey(DT, V1, V2)
system.time(DT[J(22,2), nomatch=0L])
# user  system elapsed 
# 0.00    0.00    0.01 
system.time(subset(DT, (V1==22) & (V2==2)))
# user  system elapsed 
# 0.45    0.21    0.67 

identical(DT[J(22,2), nomatch=0L],subset(DT, (V1==22) & (V2==2)))
# [1] TRUE

I also have one problem with the fast join based on binary search: I cannot find a way to select all items in one dimension.

Say if I want to subsequently do:

DT[J(22,2), nomatch=0]  # subset on TWO dimensions
DT[J(22,), nomatch=0]   # subset on ONE dimension only
# Error in list(22, ) : argument 2 is empty

without having to re-set the key to only one dimension (because I am in a loop and I don't want to rest the keys every time).

225

asked May 20 '14 09:05

Timothée HENRY

Video Answer

1 Answers

What's the fastest way to subset a `data.table`?

Using the binary search based subset feature is the fastest. Note that the subset requires the option nomatch = 0L so as to return only the matching results.

How to subset by one of the keys only with two keys set?

If you've two keys set on DT and you want to subset by the first key, then you can just provide the first value in J(.), no need to provide anything for the 2nd key. That is:

# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L]

If instead, you would like to subset by the second key, then you'll have to, as of now, provide all the unique values for the first key. That is:

# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]

This is also shown in this SO post. Although I'd prefer that DT[J(, 2)] to work for this case, as that seems rather intuitive.

There's also a pending feature request, FR #1007 for implementing secondary keys, which when done would take care of this.

Here is a better example:

DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2))
DT
#    V1 V2
# 1:  1  2
# 2:  2  3
# 3:  3  2
# 4:  4  3
# 5:  5  2
setkey(DT,V1,V2)
DT[J(unique(V1),2)]
#    V1 V2
# 1:  1  2
# 2:  2  2
# 3:  3  2
# 4:  4  2
# 5:  5  2
DT[J(unique(V1),2), nomatch=0L]
#    V1 V2
# 1:  1  2
# 2:  3  2
# 3:  5  2
DT[J(3), nomatch=0L]
#    V1 V2
# 1:  3  2

In summary:

# key(DT) = c("V1", "V2")

# data.frame                        |             data.table equivalent
# =====================================================================
# subset(DF, (V1 == 3) & (V2 == 2)) |            DT[J(3,2), nomatch=0L]
# subset(DF, (V1 == 3))             |              DT[J(3), nomatch=0L]
# subset(DF, (V2 == 2))             |  DT[J(unique(V1), 2), nomatch=0L]

199

answered Sep 30 '22 14:09

Timothée HENRY

Related questions
                            
                                Subset all levels of a single factor
                            
                                Aggregation by time period in lubridate
                            
                                Class of a sequence of numbers
                            
                                How to aggregate (using "by") a data.table with customized column name without ":="?
                            
                                read.xlsx and colClasses
                            
                                Find most frequent combination of values in a data.frame
                            
                                Custom package using parallel or doParallel for multiple OS as a CRAN package
                            
                                `j` doesn't evaluate to the same number of columns for each group
                            
                                Applying a function to every combination of elements in a vector
                            
                                Export R object for 3D printing
                            
                                How do I add a column to each data frame in a list
                            
                                Dynamic Variable naming in r
                            
                                How to calculate different well-known similarity or distance measures between two vectors in R?
                            
                                Unlist a list file to multiple dataframes [duplicate]
                            
                                How to use division in lm
                            
                                List of packages that need an update
                            
                                How to get only the plots from gam.check
                            
                                R shiny / shiny-server - issue with finding packages
                            
                                Fast way to split string and convert to long format in data.table
                            
                                Difference between two dates excluding weekends

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the fastest way to subset a data.table?

Tags:

r

data.table

Timothée HENRY

People also ask

Video Answer

1 Answers

What's the fastest way to subset a `data.table`?

How to subset by one of the keys only with two keys set?

Timothée HENRY

Recent Activity

Donate For Us