Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Update multiple data.table columns elegantly [duplicate]

Tags:

r

data.table

I'm trying to do a simple thing, divide 40 columns of a data.table by their mean. I cannot provide the actual data (not all columns are numeric, and I have > 8M rows), but here's an example:

library(data.table)   

dt <- data.table(matrix(sample(1:100,4000,T),ncol=40))
colmeans <- colMeans(dt)

Next I thought I would do:

for (col in names(colmeans)) dt[,col:=dt[,col]/colmeans[col]]   

But this returns an error since dt[,col] require that column names are not quoted. Using as.name(col) doesn't cut it. Now,

res <- t(t(dt[,1:40,with=F]/colmeans))

contains the expeded result, but I cannot insert it back in the data.table, as

dt[,1:40] <- res

does not work, neither does dt[,1:40:=res, with=F].

The following works, but I find it quite ugly:

for (i in seq_along(colmeans)) dt[,i:=dt[,i,with=F]/colmeans[i],with=F]

Sure, I could also recreate an new data.table by calling data.table() on res and the other non-numerical columns my data.table has, but isn't their anything more efficient?

like image 968
jeanlain Avatar asked Jun 09 '16 08:06

jeanlain


1 Answers

We can also use set. In this case, there should be no noticeable difference to using [.data.table along with :=, but in scenarios where [.data.table has to be called multiple times, using set() helps avoid that overhead and could be noticeably faster.

for(j in names(dt)) {
 set(dt, i=NULL, j = j, value = dt[[j]]/mean(dt[[j]]))
}

It can be also done on selected columns, i.e.

nm1 <- names(dt)[1:5]
for(j in nm1){
 set(dt, i = NULL, j = j, value = dt[[j]]/mean(dt[[j]]))
}

data

set.seed(24)
dt <- as.data.frame(matrix(sample(1:100,4000,TRUE),ncol=40))
setDT(dt)
like image 81
akrun Avatar answered Oct 17 '22 07:10

akrun