Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, group by sports data that has Away and Home teams - a common frustration

Tags:

r

dplyr

I work with sports data in R frequently, and run into the same problem with dplyr::group_by() when attempting to compute summary stats. I have the following dataframe with predicted points in each match of the world cup group stage:

dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal", 
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany", 
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland", 
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France", 
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic", 
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia", 
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland", 
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan", 
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia", 
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark", 
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic", 
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco", 
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica", 
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama", 
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco", 
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany", 
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765, 
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741, 
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816, 
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264, 
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665, 
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932, 
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787, 
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251, 
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294, 
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608, 
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778, 
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503, 
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA", 
"epB"), class = "data.frame", row.names = c(NA, -48L))

head(worldcup.df)
  teamA_name   teamB_name    epA    epB
1     Russia Saudi Arabia 1.6400 1.0940
2      Egypt      Uruguay 0.7051 2.0809
3    Morocco         Iran 1.1294 1.6016
4   Portugal        Spain 1.1116 1.6204
5     France    Australia 2.1962 0.6098
6  Argentina      Iceland 1.9840 0.7870

I've calculated epA and epB as the expected points for teams A and B in each game, and I'd now like to do a group_by() to compute each of the 32 teams total expected points. What I have historically done is something along these lines:

asAgroupby = worldcup.df %>% 
  dplyr::group_by(teamA_name) %>%
  dplyr::summarise(totPts = sum(epA))

asBgroupby = worldcup.df %>% 
  dplyr::group_by(teamB_name) %>%
  dplyr::summarise(totPts = sum(epB))

outputdf = asAgroupby %>%
  dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
  dplyr::mutate(totPts = totPts.x + totPts.y) %>%
  dplyr::select(-one_of(c('totPts.x', 'totPts.y')))

2 separate group_by() calls, for each of the teamA and teamB column, followed by a left_join and then summing the columns and removing the excess column... yuck. This is as simple a case as this problem comes in as well: exactly 4 columns (2 identifying columns, and 2 stat columns). Since tons of sports data has columns for home / away teams, this is a common problem.

I feel like I need 1 dataframe with 2x the number of rows and 1/2 the number of columns so I can do one group by. Any help at all is appreciated, thanks!!!

Edit: worldcup.df is built from a long %>% of dplyr functions - bonus points if this can be done without creating new variables, but rather just:

worldcup.df %>%
  ... 
like image 649
Canovice Avatar asked Jun 05 '18 19:06

Canovice


3 Answers

Here's a tidyverse workflow that works by reformatting the data to a long format. It does keep track of who is in the same game (game_id), and whether they were the A or B team - if that is useful. (in all fairness, this is the same basic idea as @Emil, just a different workflow to achieve it.)

worldcup.long <- worldcup.df %>% 
  as_data_frame() %>%
  mutate(game_id = 1:n()) %>%
  gather(key, value, - game_id) %>%
  mutate(
    AB = str_extract(key, "A|B"),
    key = str_extract(key, "team|ep")
  ) %>%
  spread(key, value,convert = TRUE) 

outputdf <- worldcup.long %>%
  group_by(team) %>%
  summarize(totPts = sum(ep))
like image 169
Melissa Key Avatar answered Sep 23 '22 05:09

Melissa Key


Here is a solution that is fewer lines and doesn't require a join:

df2 <- df[,c(2,1,4,3)]
names(df2) <- names(df)
rbind(df, df2) %>% group_by(teamA_name) %>% summarise(sum(epA))

# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82

Which is the same as the OP's:

outputdf
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82
like image 27
Emil Avatar answered Sep 19 '22 05:09

Emil


I've run into this issue as well with some fantasy football stuff. This is how I generally handle it:

df %>% select(team = teamA_name, ep = epA) %>% 
     bind_rows(df %>% select(team = teamB_name, ep = epB)) %>% 
     group_by(team) %>% 
     summarize(ep = sum(ep))
like image 24
TBT8 Avatar answered Sep 21 '22 05:09

TBT8