Using dynamic column names in `data.table`

Tags:

I want to calculate mean of each of several columns in a data.table, grouped by another column. My question is similar to two other questions on SO (one and two) but I couldn't apply those on my problem.

Here is an example:

library(data.table) dtb <- fread(input = "condition,var1,var2,var3       one,100,1000,10000       one,101,1001,10001       one,102,1002,10002       two,103,1003,10003       two,104,1004,10004       two,105,1005,10005       three,106,1006,10006       three,107,1007,10007       three,108,1008,10008       four,109,1009,10009       four,110,1010,10010")  dtb #    condition var1 var2  var3 # 1:       one  100 1000 10000 # 2:       one  101 1001 10001 # 3:       one  102 1002 10002 # 4:       two  103 1003 10003 # 5:       two  104 1004 10004 # 6:       two  105 1005 10005 # 7:     three  106 1006 10006 # 8:     three  107 1007 10007 # 9:     three  108 1008 10008 # 10:     four  109 1009 10009 # 11:     four  110 1010 10010

The calculation of each single mean is easy; e.g. for "var1": dtb[ , mean(var1), by = condition]. But I this quickly becomes cumbersome if there are many variables and you need to write all of them. Thus, dtb[, list(mean(var1), mean(var2), mean(var3)), by = condition] is undesirable. I need the column names to be dynamic and I wish to end up with something like this:

   condition  var1   var2    var3 1:       one 101.0 1001.0 10001.0 2:       two 104.0 1004.0 10004.0 3:     three 107.0 1007.0 10007.0 4:      four 109.5 1009.5 10009.5

394

asked Feb 18 '13 13:02

David D

1 Answers

you should use .SDcols (especially if you've too many columns and you require a particular operation to be performed only on a subset of the columns (apart from the grouping variable columns).

dtb[, lapply(.SD, mean), by=condition, .SDcols=2:4]  #    condition  var1   var2    var3 # 1:       one 101.0 1001.0 10001.0 # 2:       two 104.0 1004.0 10004.0 # 3:     three 107.0 1007.0 10007.0 # 4:      four 109.5 1009.5 10009.5

You could also get all the column names you'd want to take mean of first in a variable and then pass it to .SDcols like this:

keys <- setdiff(names(dtb), "condition") # keys = var1, var2, var3 dtb[, lapply(.SD, mean), by=condition, .SDcols=keys]

Edit: As Matthew Dowle rightly pointed out, since you require mean to be computed on every other column after grouping by condition, you could just do:

dtb[, lapply(.SD, mean), by=condition]

David's edit: (which got rejected): Read more about .SD from this post. I find this is relevant here. Thanks @David.

Edit 2: Suppose you have a data.table with 1000 rows and 301 columns (one column for grouping and 300 numeric columns):

require(data.table) set.seed(45) dt <- data.table(grp = sample(letters[1:15], 1000, replace=T)) m  <- matrix(rnorm(300*1000), ncol=300) dt <- cbind(dt, m) setkey(dt, "grp")

and you wanted to find the mean of the columns, say, 251:300 alone,

you can compute the mean of all the columns and then subset these columns (which is not very efficient as you'll compute on the whole data).
```
dt.out <- dt[, lapply(.SD, mean), by=grp] dim(dt.out) # 15 * 301, not efficient. 
```
you can filter the data.table first to just these columns and then compute the mean (which is again not necessarily the best solution as you have to create an extra subset'd data.table every time you want operations on certain columns.
```
dt.sub <- dt[, c(1, 251:300)] setkey(dt.sub, "grp") dt.out <- dt.sub[, lapply(.SD, mean), by=grp] 
```
you can specify each of the columns one by one as you'd normally do (but this is desirable for smaller data.tables)
```
# if you just need one or few columns dt.out <- dt[, list(m.v251 = mean(V251)), by = grp] 
```

So what's the best solution? The answer is .SDcols.

As the documentation states, for a data.table x, .SDcols specifies the columns that are included in .SD.

This basically implicitly filters the columns that will be passed to .SD instead of creating a subset (as we did before), only it is VERY efficient and FAST!

How can we do this?

By specifiying either the column numbers:

dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = 251:300] dim(dt.out) # 15 * 51 (what we expect)

Or alternatively by specifying the column id:

ids <- paste0("V", 251:300) # get column ids dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = ids] dim(dt.out) # 15 * 51 (what we expect)

It accepts both column names and numbers as arguments. In both these cases, .SD will be provided only with these columns we've specified.

Hope this helps.

180

answered Oct 06 '22 13:10

Arun

Related questions
                            
                                R: Insert a vector as a row in data.frame
                            
                                How should I handle 'helper' functions in an R package?
                            
                                S3 method consistency warning when building R package with Roxygen
                            
                                'Embedded nul in string' error when importing csv with fread
                            
                                doParallel error in R: Error in serialize(data, node$con) : error writing to connection
                            
                                Is it possible to modify a data.frame in-place (destructively)?
                            
                                Migrating R libraries
                            
                                How to get vector of options from server.R to ui.R for selectInput in Shiny R App
                            
                                traceback() for interactive and non-interactive R sessions
                            
                                Non character argument in R string split function (strsplit)
                            
                                R - dplyr Summarize and Retain Other Columns
                            
                                Are there any official naming conventions for R?
                            
                                dplyr mutate in R - add column as concat of columns
                            
                                Understanding .I in data.table in R
                            
                                ggplot geom_point() with colors based on specific, discrete values
                            
                                Why is this naive matrix multiplication faster than base R's?
                            
                                rm(list=ls()) doesn't completely clear the workspace
                            
                                ggplot2 : Adding two errorbars to each point in scatterplot
                            
                                R equivalent of python "_"?
                            
                                display a matrix, including the values, as a heatmap

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using dynamic column names in `data.table`

Tags:

r

data.table

David D

People also ask

1 Answers

Arun

Recent Activity

Donate For Us