I have a dataset with individual decisions taken in groups. For each individual I need an aggregated (let's say, sum) of all decisions of his/her group members. So let's say the data looks like:
set.seed(123)
group_id <- c(sapply(seq(1, 3), rep, times = 3))
person_id <- rep(seq(1,3),3)
decision <- sample(1:10, 9, replace=T)
df <-data.frame(group_id, person_id, decision)
df
The result is:
group_id person_id decision
1 1 1 3
2 1 2 8
3 1 3 5
4 2 1 9
5 2 2 10
6 2 3 1
7 3 1 6
8 3 2 9
9 3 3 6
And I need to produce something like that:
group_id person_id decision others_decision
1 1 1 3 13
2 1 2 8 8
3 1 3 5 11
So for each element of the group, I get all other members of the same group and do something (a sum). I can do this with just a for
loop but it seems ugly and inefficient. Are there better solutions?
UPDATE:
Here is the solution I figured out so far, sorry for ugliness:
df$other_decision=unlist(by(df, 1:nrow(df), function(row) {
df %>% filter(group_id==row$group_id, person_id!=row$person_id) %>% summarize(sum(decision))
}
))
df
You can do:
df %>%
inner_join(df, by = c("group_id" = "group_id")) %>%
filter(person_id.x != person_id.y) %>%
group_by(group_id, person_id = person_id.x) %>%
summarise(decision = first(decision.x),
others_decison = sum(decision.y))
group_id person_id decision others_decison
<int> <int> <int> <int>
1 1 1 3 13
2 1 2 8 8
3 1 3 5 11
4 2 1 9 11
5 2 2 10 10
6 2 3 1 19
7 3 1 6 15
8 3 2 9 12
9 3 3 6 15
Depending on your actual dataset (its size), it may become computationally rather demanding as it involves an inner join.
Another possibility not involving an inner join could be:
df %>%
group_by(group_id) %>%
mutate(others_decison = list(decision),
rowid = 1:n()) %>%
ungroup() %>%
rowwise() %>%
mutate(others_decison = sum(unlist(others_decison)[-rowid])) %>%
ungroup() %>%
select(-rowid)
This can be accomplished fairly simply by creating a function that takes a function as an argument and removes each observation from the vector passed to it in turn.
library(dplyr)
my_summarise <- function(x, FUN, ...) {
sapply(seq_along(x), function(y)
FUN(x[-y], ...))
}
df %>%
group_by(group_id) %>%
mutate(dsum = my_summarise(decision, sum),
dmean = my_summarise(decision, mean),
dmax = my_summarise(decision, max))
# A tibble: 9 x 6
# Groups: group_id [3]
group_id person_id decision dsum dmean dmax
<int> <int> <int> <int> <dbl> <int>
1 1 1 3 13 6.5 8
2 1 2 8 8 4 5
3 1 3 5 11 5.5 8
4 2 1 9 11 5.5 10
5 2 2 10 10 5 9
6 2 3 1 19 9.5 10
7 3 1 6 15 7.5 9
8 3 2 9 12 6 6
9 3 3 6 15 7.5 9
Here are a few data.table methods:
library(data.table)
dt <- as.data.table(df)
# don't update original dt
dt[dt, on = .(group_id), allow.cartesian = T
][person_id != i.person_id,
.(decison = first(i.decision), others = sum(decision)),
by = .(group_id, person_id = i.person_id)]
#update the original dt way 1
dt[,
others_decision := .SD[.SD, on = .(group_id), allow.cartesian = T
][person_id != i.person_id, sum(decision), by = .(group_id,i.person_id)]$V1
]
#update the original dt way 2
dt1[,
others_decision := dt[group_id == .BY[[1]] & person_id != .BY[[2]], sum(decision)],
by = .(group_id, person_id)]
The first two main things are more-or-less @tmfmnk's approach but via data.table
. The last is more intuitive to me but is likely the slowest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With