In R, group by sports data that has Away and Home teams - a common frustration

Question

I work with sports data in R frequently, and run into the same problem with dplyr::group_by() when attempting to compute summary stats. I have the following dataframe with predicted points in each match of the world cup group stage:

dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal", 
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany", 
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland", 
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France", 
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic", 
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia", 
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland", 
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan", 
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia", 
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark", 
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic", 
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco", 
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica", 
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama", 
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco", 
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany", 
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765, 
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741, 
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816, 
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264, 
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665, 
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932, 
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787, 
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251, 
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294, 
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608, 
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778, 
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503, 
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA", 
"epB"), class = "data.frame", row.names = c(NA, -48L))

head(worldcup.df)
  teamA_name   teamB_name    epA    epB
1     Russia Saudi Arabia 1.6400 1.0940
2      Egypt      Uruguay 0.7051 2.0809
3    Morocco         Iran 1.1294 1.6016
4   Portugal        Spain 1.1116 1.6204
5     France    Australia 2.1962 0.6098
6  Argentina      Iceland 1.9840 0.7870

I've calculated epA and epB as the expected points for teams A and B in each game, and I'd now like to do a group_by() to compute each of the 32 teams total expected points. What I have historically done is something along these lines:

asAgroupby = worldcup.df %>% 
  dplyr::group_by(teamA_name) %>%
  dplyr::summarise(totPts = sum(epA))

asBgroupby = worldcup.df %>% 
  dplyr::group_by(teamB_name) %>%
  dplyr::summarise(totPts = sum(epB))

outputdf = asAgroupby %>%
  dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
  dplyr::mutate(totPts = totPts.x + totPts.y) %>%
  dplyr::select(-one_of(c('totPts.x', 'totPts.y')))

2 separate group_by() calls, for each of the teamA and teamB column, followed by a left_join and then summing the columns and removing the excess column... yuck. This is as simple a case as this problem comes in as well: exactly 4 columns (2 identifying columns, and 2 stat columns). Since tons of sports data has columns for home / away teams, this is a common problem.

I feel like I need 1 dataframe with 2x the number of rows and 1/2 the number of columns so I can do one group by. Any help at all is appreciated, thanks!!!

Edit: worldcup.df is built from a long %>% of dplyr functions - bonus points if this can be done without creating new variables, but rather just:

worldcup.df %>%
  ...

Melissa Key · Accepted Answer

Here's a tidyverse workflow that works by reformatting the data to a long format. It does keep track of who is in the same game (game_id), and whether they were the A or B team - if that is useful. (in all fairness, this is the same basic idea as @Emil, just a different workflow to achieve it.)

worldcup.long <- worldcup.df %>% 
  as_data_frame() %>%
  mutate(game_id = 1:n()) %>%
  gather(key, value, - game_id) %>%
  mutate(
    AB = str_extract(key, "A|B"),
    key = str_extract(key, "team|ep")
  ) %>%
  spread(key, value,convert = TRUE) 

outputdf <- worldcup.long %>%
  group_by(team) %>%
  summarize(totPts = sum(ep))

Emil · Answer

Here is a solution that is fewer lines and doesn't require a join:

df2 <- df[,c(2,1,4,3)]
names(df2) <- names(df)
rbind(df, df2) %>% group_by(teamA_name) %>% summarise(sum(epA))

# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82

Which is the same as the OP's:

outputdf
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82

TBT8 · Answer

I've run into this issue as well with some fantasy football stuff. This is how I generally handle it:

df %>% select(team = teamA_name, ep = epA) %>% 
     bind_rows(df %>% select(team = teamB_name, ep = epB)) %>% 
     group_by(team) %>% 
     summarize(ep = sum(ep))

In R, group by sports data that has Away and Home teams - a common frustration

Tags:

r

dplyr

Canovice

3 Answers

Melissa Key

Emil

TBT8

Recent Activity

Donate For Us

In R, group by sports data that has Away and Home teams - a common frustration

Tags:

r

dplyr

Canovice

3 Answers

Melissa Key

Emil

TBT8

Related questions

Recent Activity

Donate For Us