How to use data.table within functions and loops?

Tags:

While assessing the utility of data.table (vs. dplyr), a critical factor is the ability to use it within functions and loops.
For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't or does poorly? so that, instead of hard-coded dataset variables names ("cut" and "price" variables of "diamonds" dataset), it becomes dataset-agnostic - cut-n-paste ready for the use inside any function or a loop (when we don't know column names in advance).

This is the original code:

library(data.table)
dt <- data.table(ggplot2::diamonds)
dt[cut != "Fair", .(mean(price),.N), by = cut]

This is its dataset-agnostic equivalent:

dt <- data.table(diamonds)
nVarGroup <- 2 #"cut"
nVarMeans <- 7 #"price"

strGroupConditions <- levels(dt[[nVarGroup]])[-1] # "Good" "Very Good" "Premium" "Ideal" 
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans))) #! w/o get() it does not work! 
qGroup <- quote(get(strVarGroup) %in% strGroupConditions) #! w/o get() it does not work! 
dt[eval(qGroup), .(eval(qAction), .N), by = strVarGroup]

Note (Thanks to reply below): if you need to change variable value by reference, you need to use (), not get(), as shown below:

strVarToBeReplaced <- names(dt)[1]
dt[eval(qGroup), (strVarToBeReplaced) := eval(qAction), by = strGroup][]

Now: you can cut-n-paste the following code for all your looping needs:

for(nVarGroup in 2:4)       # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions <- levels(dt[[nVarGroup]])[-1] 
    strVarGroup <- names(dt)[nVarGroup]
    strVarMeans <- names(dt)[nVarMeans]
    qAction  <- quote(mean(get(strVarMeans))) 
    qGroup <- quote(get(strVarGroup) %in% strGroupConditions) 
    p <- dt[eval(qGroup), .(AVE=eval(qAction), COUNT=.N), by = strVarGroup]

    print(sprintf("nVaGroup=%s, nVarMeans=%s: ", strVarGroup, strVarMeans))
    print(p)
  }

My first question:
The code above, while enabling the required functional/looping needs, appears quite convoluted. - It uses different multiple (possibly non-consistent) non-intuitive tricks such combination of (), get(), quote()/eval(), [[]]). Seems too many for a such straightforward need...

Is there another better way of accessing and modifying data.tables values in loops? Perhaps with on=, lapply/.SD/.SDcols?

Please share your ideas below. This discussion aims to supplement and consolidate related bits from other posts (such as listed here: How can one work fully generically in data.table in R with column names in variables). Eventually, it would be great to create a dedicated vignette for using data.table within functions and loops.

The second question:
Is dplyr easier for this purpose? - For this question however, I've set a separate post: Is dplyr easier than data.table to be used within functions and loops?.

951

asked Feb 21 '17 19:02

IVIM

1 Answers

The OP has asked for a dataset-agnostic equivalent for grouping and aggregating.

With development version 1.10.5, data.table has gained new Grouping Sets functions: rollup(), cube(), and groupingsets() which allow aggregation on various grouping levels at once producing sub-totals and grand total.

The added level of abstraction can be used for a dataset-agnostic approach. The subtotals which are computed using a double nested for loop in OP's example can be produced as well by

library(data.table) # version 1.10.5 required
dt = data.table(ggplot2::diamonds)
groupingsets(dt, c(lapply(.SD, mean), list(COUNT = .N)), 
     by = names(dt)[2:4], .SDcols = 5:10, id = FALSE,
     sets = as.list(names(dt)[2:4]))

          cut color clarity    depth    table    price        x        y        z COUNT
 1:     Ideal    NA      NA 61.70940 55.95167 3457.542 5.507451 5.520080 3.401448 21551
 2:   Premium    NA      NA 61.26467 58.74610 4584.258 5.973887 5.944879 3.647124 13791
 3:      Good    NA      NA 62.36588 58.69464 3928.864 5.838785 5.850744 3.639507  4906
 4: Very Good    NA      NA 61.81828 57.95615 3981.760 5.740696 5.770026 3.559801 12082
 5:      Fair    NA      NA 64.04168 59.05379 4358.758 6.246894 6.182652 3.982770  1610
 6:        NA     E      NA 61.66209 57.49120 3076.752 5.411580 5.419029 3.340689  9797
 7:        NA     I      NA 61.84639 57.57728 5091.875 6.222826 6.222730 3.845411  5422
 8:        NA     J      NA 61.88722 57.81239 5323.818 6.519338 6.518105 4.033251  2808
 9:        NA     H      NA 61.83685 57.51781 4486.669 5.983335 5.984815 3.695965  8304
10:        NA     F      NA 61.69458 57.43354 3724.886 5.614961 5.619456 3.464446  9542
11:        NA     G      NA 61.75711 57.28863 3999.136 5.677543 5.680192 3.505021 11292
12:        NA     D      NA 61.69813 57.40459 3169.954 5.417051 5.421128 3.342827  6775
13:        NA    NA     SI2 61.77217 57.92718 5063.029 6.401370 6.397826 3.948478  9194
14:        NA    NA     SI1 61.85304 57.66254 3996.001 5.888383 5.888256 3.639845 13065
15:        NA    NA     VS1 61.66746 57.31515 3839.455 5.572178 5.581828 3.441007  8171
16:        NA    NA     VS2 61.72442 57.41740 3924.989 5.657709 5.658859 3.491478 12258
17:        NA    NA    VVS2 61.66378 57.02499 3283.737 5.218454 5.232118 3.221465  5066
18:        NA    NA    VVS1 61.62465 56.88446 2523.115 4.960364 4.975075 3.061294  3655
19:        NA    NA      I1 62.73428 58.30378 3924.169 6.761093 6.709379 4.207908   741
20:        NA    NA      IF 61.51061 56.50721 2864.839 4.968402 4.989827 3.061659  1790

So, we do not have to know the names of the columns. However, we have to specify which columns to group by and which columns to aggregate.

124

answered Oct 21 '22 04:10

Uwe

Related questions
                            
                                how to find elements from a list that are not present in another list in r
                            
                                R ggplot2 using italics and non-italics in the same category label
                            
                                Multi page intro.js with Shiny
                            
                                r plotly 3d surface plot issue
                            
                                R: Add "th", "rd" and "nd" to dates
                            
                                Writing array to CSV, unexpected behaviour for col.names = F
                            
                                splitting and reordering character string by comma in r
                            
                                ggbiplot - change the group color and marker
                            
                                Extract rows for the first occurrence of a variable in a group
                            
                                devtools::check fails because of vignette building
                            
                                Specify which shell to use in R
                            
                                How to display ``` in knitr chunks?
                            
                                How to avoid R converting dates to numeric automatically?
                            
                                Substitute for mutate (dplyr package) in python pandas
                            
                                When does the argument go inside or outside aes()?
                            
                                importing poorly structured data in r
                            
                                Change the shape of action button in shiny
                            
                                writeRaster output file size
                            
                                replace values of a dataframe based on values of the previous rows
                            
                                Add Metadata to Seurat Object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use data.table within functions and loops?

Tags:

function

loops

r

data.table

dplyr

IVIM

People also ask

1 Answers

Uwe

Recent Activity

Donate For Us