Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative mean with conditionals

New to R. Small rep of my df:

PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway))
df

TeamHome TeamAway PTS_TeamHome PTS_TeamAway
  LAL      IND          101           95
  HOU      LAL           87           89
  SAS      LAL           94          105
  MIA      HOU          110          111
  LAL      NOP           95          121

Imagine these are the first four games of a season with 1230 games. I want to calculate the cumulative points per game (mean) at any given time for the home team and the visiting team.

The output would look like this:

  TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
1  LAL      IND          101           95                101                 95
2  HOU      LAL           87           89                 87                 95
3  SAS      LAL           94          105                 94              98.33
4  MIA      HOU          110          111                110                 99
5  LAL      NOP           95          121               97.5                121

Note that what the formula does for the fifth game for the home team. Since the LAL is the home team it looks for how many points has LAL scored when playing at home or on the road. In this case (101 + 89 + 105 + 95) / 4 = 97.5

Here is what I tried without much success:

lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- ( cumsum(df[which(df$TEAM1[1:i]==df$TEAM1[i]),df$PTS_TeamAway,0]) 
                                 + cumsum(df[which(df$TEAM2[1:i]==df$TEAM1[i]),df$PTS_TeamHome,0]) ) 
                             / #divided by number of games
  df$HOMETEAM_AVGCUMPTS <- unlist(lst)

I wanted to calculate the cumulative PTS and then the number of games to divide it by but none of this worked.

like image 616
Sburg13 Avatar asked Dec 20 '22 00:12

Sburg13


1 Answers

I would argue that you should restructure your data in a tidier format with two rows per game: one row for the visiting team and one row for the home team. It is much easier to work with data that is in a tidy/long format.

library(dplyr)
library(tidyr)

df %>%
  mutate(game = row_number()) %>%
  gather(location, team, TeamHome, TeamAway) %>%
  gather(location2, points, PTS_TeamHome, PTS_TeamAway) %>%
  filter(
    (location == "TeamHome" & location2 == "PTS_TeamHome") | 
      (location == "TeamAway" & location2 == "PTS_TeamAway")
  ) %>%
  select(-location2) %>%
  arrange(game) %>%
  group_by(team) %>%
  mutate(run_mean_points = cummean(points))

data

# note that cbind() is removed.

df <- data.frame(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway, stringsAsFactors = FALSE)

Source: local data frame [10 x 5]
Groups: team

   game location team points run_mean_points
1     1 TeamHome  LAL    101       101.00000
2     1 TeamAway  IND     95        95.00000
3     2 TeamHome  HOU     87        87.00000
4     2 TeamAway  LAL     89        95.00000
5     3 TeamHome  SAS     94        94.00000
6     3 TeamAway  LAL    105        98.33333
7     4 TeamHome  MIA    110       110.00000
8     4 TeamAway  HOU    111        99.00000
9     5 TeamHome  LAL     95        97.50000
10    5 TeamAway  NOP    121       121.00000
like image 150
davechilders Avatar answered Jan 07 '23 19:01

davechilders