There are multiple ways to select columns of data.table by using a variable holding the desired column names (with=FALSE
, ..
, mget
, ...).
Is there a consensus which to use (when)? Is one more data.table
-y than the others?
I could come up with the following arguments:
with=FALSE
and ..
are almost equally fast, while mget
is slower..
can't select concatenated column names "on the fly" (EDIT: current CRAN version 1.12.8
definitely can, I was using an old version, which could not, so this argument is flawed)mget()
is close to the useful syntax of get()
, which seems to be the only way to use a variable name in a calculation in jTo (1):
library(data.table)
library(microbenchmark)
a <- mtcars
setDT(a)
selected_cols <- names(a)[1:4]
microbenchmark(a[, mget(selected_cols)],
a[, selected_cols, with = FALSE],
a[, ..selected_cols],
a[, .SD, .SDcols = selected_cols])
#Unit: microseconds
# expr min lq mean median uq max neval cld
# a[, mget(selected_cols)] 468.483 495.6455 564.2953 504.0035 515.4980 4341.768 100 c
# a[, selected_cols, with = FALSE] 106.254 118.9385 141.0916 124.6670 130.1820 966.151 100 a
# a[, ..selected_cols] 112.532 123.1285 221.6683 129.9050 136.6115 2137.900 100 a
# a[, .SD, .SDcols = selected_cols] 277.536 287.6915 402.2265 293.1465 301.3990 5231.872 100 b
To (2):
b <- data.table(x = rnorm(1e6),
y = rnorm(1e6, mean = 2, sd = 4),
z = sample(LETTERS, 1e6, replace = TRUE))
selected_col <- "y"
microbenchmark(b[, mget(c("x", selected_col))],
b[, c("x", selected_col), with = FALSE],
b[, c("x", ..selected_col)])
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# b[, mget(c("x", selected_col))] 5.454126 7.160000 21.752385 7.771202 9.301334 147.2055 100 b
# b[, c("x", selected_col), with = FALSE] 2.520474 2.652773 7.764255 2.944302 4.430173 100.3247 100 a
# b[, c("x", ..selected_col)] 2.544475 2.724270 14.973681 4.038983 4.634615 218.6010 100 ab
To (3):
b[, sqrt(get(selected_col))][1:5]
# [1] NaN 1.3553462 0.7544402 1.5791845 1.1007728
b[, sqrt(..selected_col)]
# error
b[, sqrt(selected_col), with = FALSE]
# error
EDIT: added .SDcols
to the benchmark in (1), b[, c("x", ..selected_col)]
to (2).
data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.
With command rbindlist from the data. table package, we can append dt_add_row and new_row row-wise. Object dt_add_row, shown in Table 2, shows the original data. table with the added row number 6.
First, we need to load data.table package in the working space. We can select a subset of datatable columns by index operator – [] Example: R program to select subset of columns from the data table Method 2: Using ! Using ! operator before columns can be enough to get the job done by this approach.
(*) The column named COLOR in the table named PRODUCTS will be assigned default values. The column named COLOR in the table named PRODUCTS will be created. The column named COLOR in the table named PRODUCTS will be deleted. Review your answers, feedback, and question scores below. An asterisk (*) indicates a correct answer. 11.
The MGET subcommand is not applicable for generation data groups (GDGs). Following is an example of mget * with SITE LISTSUBDIR. This setting affects processing of the NLST command.
Check out the SQL syntax for more information. In case you want to query data from all columns of a table, you can use the asterisk (*) operator, like this: Notice that SQL is case-insensitive. It means that the SELECT and select keywords are the same.
Should I use mget(), .. or with=FALSE to select columns of a data.table?
You should use whatever is your preference, as long as it is not deprecated of course. I don't see any realistic use case when performance differences across presented solutions would be making real difference.
The are some arguments for using with=FALSE
over other interfaces but those are more related to maintenance of those interfaces, and not really user usage.
In recent data.table version, starting from 1.14.1, there is a new feature for working with data.table in a way that enables deep parameterizing data.table queries. This new interface, let's call it "env
arg" can be used to solve the problem in your question. Yes, another way to solve your problem. This env
arg interface is much more generic, so in such a simply use case I would still use with=FALSE
. Below I added verbose=TRUE
to this new interface usage so readers can see how queries were pre-processed for substitutions of variables.
b = data.table(x = 1L, y = 2, z = "c")
selected_col = "y"
b[, c("x", selected_col), with=FALSE]
# x y
# <int> <num>
#1: 1 2
b[, .cols, env=list(.cols=I(c("x",selected_col))), verbose=T]
#Argument 'j' after substitute: c("x", "y")
# x y
# <int> <num>
#1: 1 2
b[, .cols, env=list(.cols=as.list(c("x",selected_col))), verbose=T]
#Argument 'j' after substitute: list(x, y)
# x y
# <int> <num>
#1: 1 2
New env
interface will also nicely support (3)
b[, sqrt(.col), env=list(.col=selected_col), verbose=T]
#Argument 'j' after substitute: sqrt(y)
#[1] 1.414214
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With