Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass "everything possible" to by in a function?

Tags:

I am trying to use data.table within a user facing function in a package I'm working on. I would like this function to behave as data.table-like as possible. This means for example that my function also features a by argument, which is passed to the underlying data.table call within the function. The user should be free to pass anything into "my" by which is possible directly in a data.table.

Citing from ?data.table this includes:

  1. A single unquoted column name: e.g., DT[, .(sa=sum(a)), by=x]
  2. a list() of expressions of column names: e.g., DT[, .(sa=sum(a)), by=.(x=x>0, y)]
  3. a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., DT[, sum(a), by="x,y,z"]
  4. a character vector of column names: e.g., DT[, sum(a), by=c("x", "y")]
  5. or of the form startcol:endcol: e.g., DT[, sum(a), by=x:z]

Here is a minimal (partially) working example to make my intent clear:

library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.2
sample_dt <- data.table(a = 1:5, b = 5:1)

count_by <- function(dt, by = NULL) {
    by <- substitute(by)
    dt[, .N, by = eval(by, dt, parent.frame())]
}

count_by(sample_dt)               
#>    N
#> 1: 5
count_by(sample_dt, by = a)       # refers to 1 from the list above
#>    by N
#> 1:  1 1
#> 2:  2 1
#> 3:  3 1
#> 4:  4 1
#> 5:  5 1
count_by(sample_dt, by = list(a)) # refers to 2 from the list above
#>    a N
#> 1: 1 1
#> 2: 2 1
#> 3: 3 1
#> 4: 4 1
#> 5: 5 1
count_by(sample_dt, by = "a")     # refers to 3 from the list above
#>    a N
#> 1: 1 1
#> 2: 2 1
#> 3: 3 1
#> 4: 4 1
#> 5: 5 1
count_by(sample_dt, by = c("a"))  # refers to 4 from the list above
#> Error in `[.data.table`(dt, , .N, by = eval(by, dt, parent.frame())): 'by' appears to evaluate to column names but isn't c() or key(). Use by=list(...) if you can. Otherwise, by=evalc("a") should work. This is for efficiency so data.table can detect which columns are needed.
count_by(sample_dt, by = a:b)     # refers to 5 from the list above
#>    a b N
#> 1: 1 5 1
#> 2: 2 4 1
#> 3: 3 3 1
#> 4: 4 2 1
#> 5: 5 1 1

Created on 2020-02-18 by the reprex package (v0.3.0)

Apart from case 4, everything works as expected using simple substitution and evaluation in the proper context. So my question is:

How can I create functions, which use data.table internally and mimic the original by user interface exactly?


Session info

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2020-02-18                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.2)
#>  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
#>  callr         3.4.1   2020-01-24 [1] CRAN (R 3.6.2)
#>  cli           2.0.1   2020-01-08 [1] CRAN (R 3.6.2)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.2)
#>  data.table  * 1.12.8  2019-12-09 [1] CRAN (R 3.6.2)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.2)
#>  devtools      2.2.1   2019-09-24 [1] CRAN (R 3.6.2)
#>  digest        0.6.23  2019-11-23 [1] CRAN (R 3.6.2)
#>  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.2)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.2)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.2)
#>  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.2)
#>  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.2)
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.2)
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.2)
#>  knitr         1.27    2020-01-16 [1] CRAN (R 3.6.2)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.2)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.2)
#>  pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.2)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.2)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.2)
#>  processx      3.4.1   2019-07-18 [1] CRAN (R 3.6.2)
#>  ps            1.3.0   2018-12-21 [1] CRAN (R 3.6.2)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.2)
#>  Rcpp          1.0.3   2019-11-08 [1] CRAN (R 3.6.2)
#>  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.2)
#>  rlang         0.4.4   2020-01-28 [1] CRAN (R 3.6.2)
#>  rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.2)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.2)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.2)
#>  stringi       1.4.4   2020-01-09 [1] CRAN (R 3.6.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.2)
#>  testthat      2.3.1   2019-12-01 [1] CRAN (R 3.6.2)
#>  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.2)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.2)
#>  xfun          0.12    2020-01-13 [1] CRAN (R 3.6.2)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.2)
#> 
#> [1] C:/Program Files/R/library

like image 575
der_grund Avatar asked Feb 18 '20 11:02

der_grund


1 Answers

Is there a particular reason for using eval inside the data.table? I think this would be better:

count_by <- function(dt, by = NULL) {
  eval(substitute(dt[, .N, by = by]))
}

It passes all test cases (of course). Even the first one, where your function fails with column name by.

like image 151
Roland Avatar answered Nov 15 '22 05:11

Roland