First of all: thanks to @MattDowle; data.table
is among the best things that ever happened to me since I started using R
.
Second: I am aware of many workarounds for various use cases of variable column names in data.table
, including:
and probably more I haven't referenced.
But: even if I learned all the tricks documented above to the point that I never had to look them up to remind myself how to use them, I still would find that working with column names that are passed as parameters to a function is an extremely tedious task.
What I'm looking for is a "best-practices-approved" alternative to the following workaround / workflow. Consider that I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable.
I realize this issue sounds contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:
library(data.table) library(lubridate) library(zoo) the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400))) the.table[,`:=`(var2=var1/floor(runif(6,2,5)), var3=var1/floor(runif(6,2,5)))] # Replicate data across months new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")),by=year] # Do a complicated procedure to each variable in some group. var.names <- c("var1","var2","var3") for(varname in var.names) { #As suggested in an answer to Link 3 above #Convert the column name to a 'quote' object quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')'))) #Do this for every column name I'll need varname <- quote.convert(varname) anntot <- quote.convert(paste0(varname,".annual.total")) monthly <- quote.convert(paste0(varname,".monthly")) rolling <- quote.convert(paste0(varname,".rolling")) scaled <- quote.convert(paste0(varname,".scaled")) #Perform the relevant tasks, using eval() #around every variable columnname I may want new.table[,eval(anntot):= the.table[,rep(eval(varname),each=12)]] new.table[,eval(monthly):= the.table[,rep(eval(varname)/12,each=12)]] new.table[,eval(rolling):= rollapply(eval(monthly),mean,width=12, fill=c(head(eval(monthly),1), tail(eval(monthly),1)))] new.table[,eval(scaled):= eval(anntot)/sum(eval(rolling))*eval(rolling), by=year] }
Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of data.table
actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the data.table
special objects .I
, .SD
, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote
-ing and eval
-ing.
In particular please note that because the procedures can be fairly complex and involve repeatedly updating the data.table
and then referencing the updated columns, the standard lapply(.SD,...), ... .SDcols = ...
approach is usually not a workable substitute. Also replacing each call of eval(a.column.name)
with DT[[a.column.name]]
neither simplifies much nor works completely in general since that doesn't play nice with the other data.table
operations, as far as I am aware.
To select a column in R you can use brackets e.g., YourDataFrame['Column'] will take the column named “Column”. Furthermore, we can also use dplyr and the select() function to get columns by name or index. For instance, select(YourDataFrame, c('A', 'B') will take the columns named “A” and “B” from the dataframe.
To access a specific column in a dataframe by name, you use the $ operator in the form df$name where df is the name of the dataframe, and name is the name of the column you are interested in. This operation will then return the column you want as a vector.
Problem you are describing is not strictly related to data.table
.
Complex queries cannot be easily translated to code that machine can parse, thus we are not able to escape complexity in writing a query for complex operations.
You can try to imagine how to programmatically construct a query for the following data.table
query using dplyr
or SQL:
DT[, c(f1(v1, v2, opt=TRUE), f2(v3, v4, v5, opt1=FALSE, opt2=TRUE), lapply(.SD, f3, opt1=TRUE, opt2=FALSE)) , by=.(id1, id2)]
Assuming that all columns (id1
, id2
, v1
...v5
) or even options (opt
, opt1
, opt2
) should be passed as variables.
Because of complexity in expression of queries I don't think you could easily accomplish requirement stated in your question:
is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent
quote
-ing andeval
-ing.
Although, comparing to other programming languages, base R provides very useful tools to deal with such problems.
You already found suggestions to use get
, mget
, DT[[col_name]]
, parse
, quote
, eval
.
DT[[col_name]]
might not play well with data.table
optimizations, thus is not that useful here.parse
is probably the easiest way to construct complex queries as you can just operate on strings, but it doesn't provide basic language syntax validation. So you can ended up trying to parse a string that R parser does not accept. Additionally there is a security concern as presented in 2655#issuecomment-376781159.get
/mget
are the ones most commonly suggested to deal with such problems. get
and mget
are internally catch by [.data.table
and translated to expected columns. So you are assuming your arbitrary complex query will be able to be decomposed by [.data.table
and expected columns properly inputted.get
, variables having prefix will be de-referenced inside of [.data.table
. . In future releases dot-dot prefix may allow calls like:col1="a"; col2="b"; col3="g"; col4="x"; col5="y" DT[..col4==..col5, .(s1=sum(..col1), s2=sum(..col2)), by=..col3]
Personally I prefer quote
and eval
instead. quote
and eval
is interpreted almost as written by hand from scratch. This method does not rely on data.table
abilities to manage references to columns. We can expect all optimizations to work the same way as if you would write those queries by hand. I found it also easier to debug as at any point you can just print quoted expression to see what is actually passed to data.table
query. Additionally there is a less space for bugs to occur. Constructing complex queries using R language object is sometimes tricky, it is easy to wrap the procedure into function so it can be applied in different use cases and easily re-used. Important to note that this method is independent from data.table
. It uses R language constructs. You can find more information about that in official R Language Definition in Computing on the language chapter.
What else?
DT[eval(qi), eval(qj), eval(qby)]
so you still have to operate on R language objects. You are welcome to put your comment there.substitute
functionality into [.data.table
using new argument env
.Going to the example. Below I will show two ways to solve it. First one will use base R metaprogramming, second one will use metaprogramming for data.table proposed in PR#4304 (see above).
I will wrap all logic into do_vars
function. Calling do_vars(donot=TRUE)
will print expressions to be computed on data.table
instead of eval
them. Below code should be run just after the OP code.
expected = copy(new.table) new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year] do_vars = function(x, y, vars, donot=FALSE) { name.suffix = function(x, suffix) as.name(paste(x, suffix, sep=".")) do_var = function(var, x, y) { substitute({ x[, .anntot := y[, rep(.var, each=12)]] x[, .monthly := y[, rep(.var/12, each=12)]] x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1)))] x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year] }, list( .var=as.name(var), .anntot=name.suffix(var, "annual.total"), .monthly=name.suffix(var, "monthly"), .rolling=name.suffix(var, "rolling"), .scaled=name.suffix(var, "scaled") )) } ql = lapply(setNames(nm=vars), do_var, x, y) if (donot) return(ql) lapply(ql, eval.parent) invisible(x) } do_vars(new.table, the.table, c("var1","var2","var3")) all.equal(expected, new.table) #[1] TRUE
we can preview queries
do_vars(new.table, the.table, c("var1","var2","var3"), donot=TRUE) #$var1 #{ # x[, `:=`(var1.annual.total, y[, rep(var1, each = 12)])] # x[, `:=`(var1.monthly, y[, rep(var1/12, each = 12)])] # x[, `:=`(var1.rolling, rollapply(var1.monthly, mean, width = 12, # fill = c(head(var1.monthly, 1), tail(var1.monthly, 1))))] # x[, `:=`(var1.scaled, var1.annual.total/sum(var1.rolling) * # var1.rolling), by = year] #} # #$var2 #{ # x[, `:=`(var2.annual.total, y[, rep(var2, each = 12)])] # x[, `:=`(var2.monthly, y[, rep(var2/12, each = 12)])] # x[, `:=`(var2.rolling, rollapply(var2.monthly, mean, width = 12, # fill = c(head(var2.monthly, 1), tail(var2.monthly, 1))))] # x[, `:=`(var2.scaled, var2.annual.total/sum(var2.rolling) * # var2.rolling), by = year] #} # #$var3 #{ # x[, `:=`(var3.annual.total, y[, rep(var3, each = 12)])] # x[, `:=`(var3.monthly, y[, rep(var3/12, each = 12)])] # x[, `:=`(var3.rolling, rollapply(var3.monthly, mean, width = 12, # fill = c(head(var3.monthly, 1), tail(var3.monthly, 1))))] # x[, `:=`(var3.scaled, var3.annual.total/sum(var3.rolling) * # var3.rolling), by = year] #} #
expected = copy(new.table) new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year] name.suffix = function(x, suffix) as.name(paste(x, suffix, sep=".")) do_var2 = function(var, x, y) { x[, .anntot := y[, rep(.var, each=12)], env = list( .anntot = name.suffix(var, "annual.total"), .var = var )] x[, .monthly := y[, rep(.var/12, each=12)], env = list( .monthly = name.suffix(var, "monthly"), .var = var )] x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1))), env = list( .rolling = name.suffix(var, "rolling"), .monthly = name.suffix(var, "monthly") )] x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year, env = list( .scaled = name.suffix(var, "scaled"), .anntot = name.suffix(var, "annual.total"), .rolling = name.suffix(var, "rolling") )] TRUE } sapply(setNames(nm=var.names), do_var2, new.table, the.table) #var1 var2 var3 #TRUE TRUE TRUE all.equal(expected, new.table) #[1] TRUE
Data and updated OP code
library(data.table) library(lubridate) library(zoo) the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400))) the.table[,`:=`(var2=var1/floor(runif(6,2,5)), var3=var1/floor(runif(6,2,5)))] # Replicate data across months new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")),by=year] # Do a complicated procedure to each variable in some group. var.names <- c("var1","var2","var3") for(varname in var.names) { #As suggested in an answer to Link 3 above #Convert the column name to a 'quote' object quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')'))) #Do this for every column name I'll need varname <- quote.convert(varname) anntot <- quote.convert(paste0(varname,".annual.total")) monthly <- quote.convert(paste0(varname,".monthly")) rolling <- quote.convert(paste0(varname,".rolling")) scaled <- quote.convert(paste0(varname,".scaled")) #Perform the relevant tasks, using eval() #around every variable columnname I may want new.table[,paste0(varname,".annual.total"):= the.table[,rep(eval(varname),each=12)]] new.table[,paste0(varname,".monthly"):= the.table[,rep(eval(varname)/12,each=12)]] new.table[,paste0(varname,".rolling"):= rollapply(eval(monthly),mean,width=12, fill=c(head(eval(monthly),1), tail(eval(monthly),1)))] new.table[,paste0(varname,".scaled"):= eval(anntot)/sum(eval(rolling))*eval(rolling), by=year] }
Thanks for the question. Your original approach goes a long way towards solving most of the issues.
Here I've tweaked the quoting function slightly, and changed the approach to parse and evaluate the entire RHS expression as a string instead of the individual variables.
The reasoning being:
library(data.table) library(lubridate) library(zoo) set.seed(1) the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400))) the.table[,`:=`(var2=var1/floor(runif(6,2,5)), var3=var1/floor(runif(6,2,5)))] # Replicate data across months new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")),by=year] # function to paste, parse & evaluate arguments evalp <- function(..., envir=parent.frame()) {eval(parse(text=paste0(...)), envir=envir)} # Do a complicated procedure to each variable in some group. var.names <- c("var1","var2","var3") for(varname in var.names) { # 1. For LHS, use paste0 to generate new column name as string (from @eddi's comment) # 2. For RHS, use evalp new.table[, paste0(varname, '.annual.total') := evalp( 'the.table[,rep(', varname, ',each=12)]' )] new.table[, paste0(varname, '.monthly') := evalp( 'the.table[,rep(', varname, '/12,each=12)]' )] # Need to add envir=.SD when working within the table new.table[, paste0(varname, '.rolling') := evalp( 'rollapply(',varname, '.monthly,mean,width=12, fill=c(head(', varname, '.monthly,1), tail(', varname, '.monthly,1)))' , envir=.SD )] new.table[,paste0(varname, '.scaled'):= evalp( varname, '.annual.total / sum(', varname, '.rolling) * ', varname, '.rolling' , envir=.SD ) ,by=year ] # Since we're working with strings, more freedom # to work programmatically new.table[, paste0(varname, '.row.percent') := evalp( 'the.table[,rep(', varname, '/ (', paste(var.names, collapse='+'), '), each=12)]' )] }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With