I'm trying to do a simple thing, divide 40 columns of a data.table by their mean. I cannot provide the actual data (not all columns are numeric, and I have > 8M rows), but here's an example:
library(data.table)
dt <- data.table(matrix(sample(1:100,4000,T),ncol=40))
colmeans <- colMeans(dt)
Next I thought I would do:
for (col in names(colmeans)) dt[,col:=dt[,col]/colmeans[col]]
But this returns an error since dt[,col]
require that column names are not quoted. Using as.name(col)
doesn't cut it.
Now,
res <- t(t(dt[,1:40,with=F]/colmeans))
contains the expeded result, but I cannot insert it back in the data.table, as
dt[,1:40] <- res
does not work, neither does dt[,1:40:=res, with=F]
.
The following works, but I find it quite ugly:
for (i in seq_along(colmeans)) dt[,i:=dt[,i,with=F]/colmeans[i],with=F]
Sure, I could also recreate an new data.table by calling data.table()
on res
and the other non-numerical columns my data.table has, but isn't their anything more efficient?
We can also use set
. In this case, there should be no noticeable difference to using [.data.table
along with :=
, but in scenarios where [.data.table
has to be called multiple times, using set()
helps avoid that overhead and could be noticeably faster.
for(j in names(dt)) {
set(dt, i=NULL, j = j, value = dt[[j]]/mean(dt[[j]]))
}
It can be also done on selected columns, i.e.
nm1 <- names(dt)[1:5]
for(j in nm1){
set(dt, i = NULL, j = j, value = dt[[j]]/mean(dt[[j]]))
}
set.seed(24)
dt <- as.data.frame(matrix(sample(1:100,4000,TRUE),ncol=40))
setDT(dt)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With