I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:
Data1
values in all rows j
, which meet these two criteria: ID(j) = ID(i)
and T1(j) = T2(i)
Enter the calculated average in the column Data2 of row i
DF = data.frame(ID=rep(c("a","b"),each=6),
T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
DT = data.table(DF)
DT[ , Data2:=NA_real_]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 NA
[2,] a 1 2 2 NA
[3,] a 1 3 3 NA
[4,] a 2 1 4 NA
[5,] a 2 2 5 NA
[6,] a 2 3 6 NA
[7,] b 1 1 7 NA
[8,] b 1 2 8 NA
[9,] b 1 3 9 NA
[10,] b 2 1 10 NA
[11,] b 2 2 11 NA
[12,] b 2 3 12 NA
For this simple example the result should look like this:
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply()
function, but I'm sure if it would solve my problem. I could also use data.frame
instead of data.table
if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.
The rule of thumb is to aggregate first, and then join to that.
agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop
which will avoid the [[3]]
bit, and maybe we could provide a way to tell [.data.table
to evaluate i
in calling scope (i.e. no self join) which would avoid the JT=
bit which is needed here because ID
is in both agg
and DT
.
keyby
has been added to v1.8.0 on R-Forge so that avoids the need for the setkey
, too.
A somewhat faster alternative to iterating over rows would be a solution which employs vectorization.
R> d <- data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
R> d
ID T1 T2 Data1
1 a 1 1 1
2 a 1 2 2
3 a 1 3 3
4 a 2 1 4
5 a 2 2 5
6 a 2 3 6
7 b 1 1 7
8 b 1 2 8
9 b 1 3 9
10 b 2 1 10
11 b 2 2 11
12 b 2 3 12
R> rowfunction <- function(i) with(d, mean(Data1[which(T1==T2[i] & ID==ID[i])]))
R> d$Data2 <- sapply(1:nrow(d), rowfunction)
R> d
ID T1 T2 Data1 Data2
1 a 1 1 1 2
2 a 1 2 2 5
3 a 1 3 3 NaN
4 a 2 1 4 2
5 a 2 2 5 5
6 a 2 3 6 NaN
7 b 1 1 7 8
8 b 1 2 8 11
9 b 1 3 9 NaN
10 b 2 1 10 8
11 b 2 2 11 11
12 b 2 3 12 NaN
Also, I'd prefer to preprocess the data before getting it into R. I.e. if you are retrieving the data from an SQL server, it might be a better choice to let the server calculate the averages, as it will very likely do a better job in this.
R is actually not very good at number crunching, for several reasons. But it's excellent when doing statistics on the already-preprocessed data.
Using tapply and part of another recent post:
DF = data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
EDIT: Actually, most of the original function is redundant and was intended for something else. Here, simplified:
ansMat <- tapply(DF$Data1, DF[, c("ID", "T1")], mean)
i <- cbind(match(DF$ID, rownames(ansMat)), match(DF$T2, colnames(ansMat)))
DF<-cbind(DF,Data2 = ansMat[i])
# ansMat<-tapply(seq_len(nrow(DF)), DF[, c("ID", "T1")], function(x) {
# curSub <- DF[x, ]
# myIndex <- which(DF$T2 == curSub$T1 & DF$ID == curSub$ID)
# meanData1 <- mean(curSub$Data1)
# return(meanData1 = meanData1)
# })
The trick was doing tapply over ID and T1 instead of ID and T2. Anything speedier?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With