I work with sports data in R frequently, and run into the same problem with dplyr::group_by() when attempting to compute summary stats. I have the following dataframe with predicted points in each match of the world cup group stage:
dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal",
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany",
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland",
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France",
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic",
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia",
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland",
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan",
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia",
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark",
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic",
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco",
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica",
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama",
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco",
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany",
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765,
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741,
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816,
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264,
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665,
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932,
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787,
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251,
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294,
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608,
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778,
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503,
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA",
"epB"), class = "data.frame", row.names = c(NA, -48L))
head(worldcup.df)
teamA_name teamB_name epA epB
1 Russia Saudi Arabia 1.6400 1.0940
2 Egypt Uruguay 0.7051 2.0809
3 Morocco Iran 1.1294 1.6016
4 Portugal Spain 1.1116 1.6204
5 France Australia 2.1962 0.6098
6 Argentina Iceland 1.9840 0.7870
I've calculated epA and epB as the expected points for teams A and B in each game, and I'd now like to do a group_by() to compute each of the 32 teams total expected points. What I have historically done is something along these lines:
asAgroupby = worldcup.df %>%
dplyr::group_by(teamA_name) %>%
dplyr::summarise(totPts = sum(epA))
asBgroupby = worldcup.df %>%
dplyr::group_by(teamB_name) %>%
dplyr::summarise(totPts = sum(epB))
outputdf = asAgroupby %>%
dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
dplyr::mutate(totPts = totPts.x + totPts.y) %>%
dplyr::select(-one_of(c('totPts.x', 'totPts.y')))
2 separate group_by() calls, for each of the teamA and teamB column, followed by a left_join and then summing the columns and removing the excess column... yuck. This is as simple a case as this problem comes in as well: exactly 4 columns (2 identifying columns, and 2 stat columns). Since tons of sports data has columns for home / away teams, this is a common problem.
I feel like I need 1 dataframe with 2x the number of rows and 1/2 the number of columns so I can do one group by. Any help at all is appreciated, thanks!!!
Edit: worldcup.df is built from a long %>% of dplyr functions - bonus points if this can be done without creating new variables, but rather just:
worldcup.df %>%
...
Here's a tidyverse
workflow that works by reformatting the data to a long format. It does keep track of who is in the same game (game_id
), and whether they were the A or B team - if that is useful. (in all fairness, this is the same basic idea as @Emil, just a different workflow to achieve it.)
worldcup.long <- worldcup.df %>%
as_data_frame() %>%
mutate(game_id = 1:n()) %>%
gather(key, value, - game_id) %>%
mutate(
AB = str_extract(key, "A|B"),
key = str_extract(key, "team|ep")
) %>%
spread(key, value,convert = TRUE)
outputdf <- worldcup.long %>%
group_by(team) %>%
summarize(totPts = sum(ep))
Here is a solution that is fewer lines and doesn't require a join:
df2 <- df[,c(2,1,4,3)]
names(df2) <- names(df)
rbind(df, df2) %>% group_by(teamA_name) %>% summarise(sum(epA))
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr> <dbl>
1 Argentina 6.02
2 Australia 2.38
3 Belgium 5.70
4 Brazil 7.03
5 Colombia 5.82
6 Costa Rica 2.64
7 Croatia 4.40
8 Denmark 3.86
9 Egypt 3.44
10 England 5.82
Which is the same as the OP's:
outputdf
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr> <dbl>
1 Argentina 6.02
2 Australia 2.38
3 Belgium 5.70
4 Brazil 7.03
5 Colombia 5.82
6 Costa Rica 2.64
7 Croatia 4.40
8 Denmark 3.86
9 Egypt 3.44
10 England 5.82
I've run into this issue as well with some fantasy football stuff. This is how I generally handle it:
df %>% select(team = teamA_name, ep = epA) %>%
bind_rows(df %>% select(team = teamB_name, ep = epB)) %>%
group_by(team) %>%
summarize(ep = sum(ep))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With