Im tryng to avoid a time consuming for loop by using an aggregate on a data.frame. But I need that the values of one of the columns enters in the final computation.
dat <- data.frame(key = c('a', 'b', 'a','b'),
rate = c(0.5,0.4,1,0.6),
v1 = c(4,0,3,1),
v2 = c(2,0,9,4))
>dat
key rate v1 v2
1 a 0.5 4 2
2 b 0.4 0 0
3 a 1.0 3 9
4 b 0.6 1 4
aggregate(dat[,-1], list(key=dat$key),
function(x, y=dat$rate){
rates <- as.numeric(y)
values <- as.numeric(x)
return(sum(values*rates)/sum(rates))
})
Note: The function is just an example!
The problem of this implementation is that y=dat$rate
gives all 4 rates on dat, when what I want is just the 2 aggregated rates!
Anny sugestion on how I could do this?
Thanks!
The aggregate() function in R is used to produce summary statistics for one or more variables in a data frame or a data.
aggregate() function is used to get the summary statistics of the data by group. The statistics include mean, min, sum.
The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases.
Here's what I managed to achieve, using the "data.table
" package:
DT <- data.table(dat, key = "key")
DT[, list(v1 = sum(rate * v1)/sum(rate), v2 = sum(rate * v2)/sum(rate)), by = "key"]
# key v1 v2
# 1: a 3.333333 6.666667
# 2: b 0.600000 2.400000
OK. So that's easy to write out for just two variables, but what about when we have a lot more columns. Use lapply(.SD,...)
in conjunction with your function:
First, some data:
set.seed(1)
dat <- data.frame(key = rep(c("a", "b"), times = 10),
rate = runif(20, min = 0, max = 1),
v1 = sample(10, 20, replace = TRUE),
v2 = sample(20, 20, replace = TRUE),
v3 = sample(30, 20, replace = TRUE),
x1 = sample(5, 20, replace = TRUE),
x2 = sample(6:10, 20, replace = TRUE),
x3 = sample(11:15, 20, replace = TRUE))
library(data.table)
datDT <- data.table(dat, key = "key")
datDT
# key rate v1 v2 v3 x1 x2 x3
# 1: a 0.26550866 10 17 28 3 9 15
# 2: a 0.57285336 7 16 14 2 7 13
# 3: a 0.20168193 3 11 20 4 9 14
# 4: a 0.94467527 1 1 15 4 6 13
# 5: a 0.62911404 9 15 3 2 10 12
# 6: a 0.20597457 5 10 11 2 10 13
# 7: a 0.68702285 5 9 11 4 7 11
# 8: a 0.76984142 9 2 15 4 6 15
# 9: a 0.71761851 8 7 26 3 9 13
# 10: a 0.38003518 8 14 24 5 8 15
# 11: b 0.37212390 3 13 9 4 7 13
# 12: b 0.90820779 2 12 10 2 10 11
# 13: b 0.89838968 4 16 8 2 7 13
# 14: b 0.66079779 4 10 23 1 8 12
# 15: b 0.06178627 4 14 27 1 8 13
# 16: b 0.17655675 6 18 26 1 9 11
# 17: b 0.38410372 2 5 11 5 8 14
# 18: b 0.49769924 7 2 27 4 6 13
# 19: b 0.99190609 2 11 12 3 6 13
# 20: b 0.77744522 5 9 29 4 9 13
Second, aggregate:
datDT[, lapply(.SD, function(x, y = rate) sum(y * x)/sum(y)), by = "key"]
# key rate v1 v2 v3 x1 x2 x3
# 1: a 0.6501303 6.335976 8.634691 15.75915 3.363832 7.658762 13.19152
# 2: b 0.7375793 3.595585 10.749705 16.26582 2.792390 7.741787 12.57301
If you have a really large dataset, you might want to explore data.table
in general.
For what it is worth, I was also successful in base R, but I'm not sure how efficient this would be, particularly because of the transposing and so on.
t(sapply(split(dat, dat[1]),
function(x, y = 3:ncol(dat)) {
V1 <- vector()
for (i in 1:length(y)) {
V1[i] <- sum(x[2] * x[y[i]])/sum(x[2])
}
V1
}))
# [,1] [,2] [,3] [,4] [,5] [,6]
# a 6.335976 8.634691 15.75915 3.363832 7.658762 13.19152
# b 3.595585 10.749705 16.26582 2.792390 7.741787 12.57301
One solution is to use ddply
from the plyr
package:
res = ddply(dat, .(key), summarise, result = sum(v1 * rate) / sum(rate))
> res
key result
1 a 3.333333
2 b 0.600000
If you want to apply this to all the v
columns, I would recommend first changing your data structure a bit:
dat = melt(dat, id.vars = c("key", "rate"))
> dat
key rate variable value
1 a 0.5 v1 4
2 b 0.4 v1 0
3 a 1.0 v1 3
4 b 0.6 v1 1
5 a 0.5 v2 2
6 b 0.4 v2 0
7 a 1.0 v2 9
8 b 0.6 v2 4
and then using ddply
again:
res = ddply(dat, .(key, variable), summarise, result = sum(value * rate) / sum(rate))
> res
key variable result
1 a v1 3.333333
2 a v2 6.666667
3 b v1 0.600000
4 b v2 2.400000
...or is you need a standard R solution, you can use by
:
res = by(dat, list(dat$key), function(x) sum(x$v1 * x$rate) / sum(x$rate))
> res
: a
[1] 3.333333
------------------------------------------------------------
: b
[1] 0.6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With