Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with temporary columns (created on-the-fly) more efficiently in a dataframe

Consider the following dataframe:

df <- data.frame(replicate(5,sample(1:10, 10, rep=TRUE)))

If I want to divide each row by its sum (to make a probability distribution), I need to do something like this:

df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)

This really feels inefficient:

  1. Create an rs column
  2. Divide each of the values by their corresponding row rowSums()
  3. Remove the temporarily created column to clean up the original dataframe.

When working with existing columns, it feels much more natural:

df %>% summarise_each(funs(weighted.mean(., X1)), -X1)

Using dplyr, would there a better way to work with temporary columns (created on-the-fly) than having to add and remove them after processing ?

I'm also interested in how data.table would handle such a task.

like image 850
Steven Beaupré Avatar asked Mar 18 '23 04:03

Steven Beaupré


1 Answers

As I mentioned in a comment above I don't think that it makes sense to keep that data in either a data.frame or a data.table, but if you must, the following will do it without converting to a matrix and illustrates how to create a temporary variable in the data.table j-expression:

dt = as.data.table(df)

dt[, names(dt) := {sums = Reduce(`+`, .SD); lapply(.SD, '/', sums)}]
like image 133
eddi Avatar answered Apr 06 '23 01:04

eddi