Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rolling conditional count in R

Tags:

r

dplyr

I'm wanting to create a rolling function that conditionally counts the occurance of two columns in previous rows.

As an example, I have a dataset that looks like the following.

# Generate data
set.seed(123)
test <- data.frame(
  Round = rep(1:5, times = 3),
  Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
  Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)

   Round   Team   Venue
1      1 Team 1 Venue B
2      2 Team 1 Venue B
3      3 Team 1 Venue A
4      4 Team 1 Venue A
5      5 Team 1 Venue B
6      1 Team 2 Venue B
7      2 Team 2 Venue B
8      3 Team 2 Venue A
9      4 Team 2 Venue A
10     5 Team 2 Venue A
11     1 Team 3 Venue B
12     2 Team 3 Venue A
13     3 Team 3 Venue B
14     4 Team 3 Venue B
15     5 Team 3 Venue B

I want a new column that shows for each row, the number of times the team in that row has played at the venue in that row in the last 3 rounds.

I can do this quite easily with a for loop.

window <- 3

for (i in 1:nrow(dat)){
  # Create index to search (if i is less than window, start at 1)
  index <- max(i - window, 1):i

  # Search when current row matches both team and venue
  dat$VenueCount[i] <- sum(dat$Team[i] == dat$Team[index] & dat$Venue[i] == dat$Venue[index])
}

   Round   Team   Venue VenueCount
1      1 Team 1 Venue B          1
2      2 Team 1 Venue B          2
3      3 Team 1 Venue A          1
4      4 Team 1 Venue A          2
5      5 Team 1 Venue B          2
6      1 Team 2 Venue B          1
7      2 Team 2 Venue B          2
8      3 Team 2 Venue A          1
9      4 Team 2 Venue A          2
10     5 Team 2 Venue A          3
11     1 Team 3 Venue B          1
12     2 Team 3 Venue A          1
13     3 Team 3 Venue B          2
14     4 Team 3 Venue B          3
15     5 Team 3 Venue B          3

However, I'm wanting to avoid a for loop (mostly as my actual dataset is relatively large at around ~30k rows). I'm thinking it should be doable with one of zoo, dplyr, purrr or apply but haven't been able to work it out.

Thanks

like image 829
jimmyday87 Avatar asked Apr 03 '18 02:04

jimmyday87


2 Answers

venturing a data.table solution here. Will take it down if you are only looking for dplyr solution

You can roll using a window of size 4, then count the number of occurrence matching the latest row.

library(data.table)
library(zoo)
setDT(test)
winsize <- 4
test[, .(Round, 
        Venue, 
        VenueCount=rollapplyr(c(rep("", winsize-1), Venue), winsize, 
            function(x) sum(x==last(x)))), 
    by=.(Team)]

result:

#       Team Round   Venue VenueCount
#  1: Team 1     1 Venue B          1
#  2: Team 1     2 Venue B          2
#  3: Team 1     3 Venue A          1
#  4: Team 1     4 Venue A          2
#  5: Team 1     5 Venue B          2
#  6: Team 2     1 Venue B          1
#  7: Team 2     2 Venue B          2
#  8: Team 2     3 Venue A          1
#  9: Team 2     4 Venue A          2
# 10: Team 2     5 Venue A          3
# 11: Team 3     1 Venue B          1
# 12: Team 3     2 Venue A          1
# 13: Team 3     3 Venue B          2
# 14: Team 3     4 Venue B          3
# 15: Team 3     5 Venue B          3
like image 166
chinsoon12 Avatar answered Sep 23 '22 04:09

chinsoon12


I actually worked out an answer using rollify from the tibbletime package with dplyr::mutate. Will post here but still open to other responses!

library(dplyr)
library(tibbletime)

# Create data
set.seed(123)
test <- data.frame(
  Round = rep(1:5, times = 3),
  Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
  Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)

Use rollify to create custom function.

last_n_games = 3
count_games <- rollify(function(x) sum(last(x) == x), window = last_n_games)

Now use mutate to run the function. This returns NA for the first 2 rows (i.e. last_n_games - 1). I can then use group_by and row_number to count these first occurrences

test <- test %>%
  group_by(Team) %>%
  mutate(VenueCount = count_games(Venue)) %>%
  group_by(Team, Venue) %>%
  mutate(VenueCount = ifelse(is.na(VenueCount), row_number(Team), VenueCount))

This returns the following

# A tibble: 15 x 4
# Groups:   Team, Venue [6]
   Round Team   Venue   VenueCount
   <int> <fct>  <fct>        <int>
 1     1 Team 1 Venue B          1
 2     2 Team 1 Venue B          2
 3     3 Team 1 Venue A          1
 4     4 Team 1 Venue A          2
 5     5 Team 1 Venue B          1
 6     1 Team 2 Venue B          1
 7     2 Team 2 Venue B          2
 8     3 Team 2 Venue A          1
 9     4 Team 2 Venue A          2
10     5 Team 2 Venue A          3
11     1 Team 3 Venue B          1
12     2 Team 3 Venue A          1
13     3 Team 3 Venue B          2
14     4 Team 3 Venue B          2
15     5 Team 3 Venue B          3
like image 20
jimmyday87 Avatar answered Sep 19 '22 04:09

jimmyday87