Using a data.table, which would be the fastest way to "sweep" out a statistic across a selection of columns?
Starting with (considerably larger versions of ) DT
p <- 3
DT <- data.table(id=c("A","B","C"),x1=c(10,20,30),x2=c(20,30,10))
DT.totals <- DT[, list(id,total = x1+x2) ]
I'd like to get to the following data.table result by indexing the target columns (2:p) in order to skip the key:
id x1 x2
[1,] A 0.33 0.67
[2,] B 0.40 0.60
[3,] C 0.75 0.25
I believe that something close to the following (which uses the relatively new set()
function) will be quickest:
DT <- data.table(id = c("A","B","C"), x1 = c(10,20,30), x2 = c(20,30,10))
total <- DT[ , x1 + x2]
rr <- seq_len(nrow(DT))
for(j in 2:3) set(DT, rr, j, DT[[j]]/total)
DT
# id x1 x2
# [1,] A 0.3333333 0.6666667
# [2,] B 0.4000000 0.6000000
# [3,] C 0.7500000 0.2500000
FWIW, calls to set()
takes the following form:
# set(x, i, j, value), where:
# x is a data.table
# i contains row indices
# j contains column indices
# value is the value to be assigned into the specified cells
My suspicion about the relative speed of this, compared to other solutions, is based on this passage from data.table's NEWS file, in the section on changes in Version 1.8.0:
o New function set(DT,i,j,value) allows fast assignment to elements of DT. Similar to := but avoids the overhead of [.data.table, so is much faster inside a loop. Less flexible than :=, but as flexible as matrix subassignment. Similar in spirit to setnames(), setcolorder(), setkey() and setattr(); i.e., assigns by reference with no copy at all. M = matrix(1,nrow=100000,ncol=100) DF = as.data.frame(M) DT = as.data.table(M) system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s system.time(for (i in 1:1000) DT[i,V1:=i]) # 1.158s system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With