I'm wanting to create a rolling function that conditionally counts the occurance of two columns in previous rows.
As an example, I have a dataset that looks like the following.
# Generate data
set.seed(123)
test <- data.frame(
Round = rep(1:5, times = 3),
Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)
Round Team Venue
1 1 Team 1 Venue B
2 2 Team 1 Venue B
3 3 Team 1 Venue A
4 4 Team 1 Venue A
5 5 Team 1 Venue B
6 1 Team 2 Venue B
7 2 Team 2 Venue B
8 3 Team 2 Venue A
9 4 Team 2 Venue A
10 5 Team 2 Venue A
11 1 Team 3 Venue B
12 2 Team 3 Venue A
13 3 Team 3 Venue B
14 4 Team 3 Venue B
15 5 Team 3 Venue B
I want a new column that shows for each row, the number of times the team in that row has played at the venue in that row in the last 3 rounds.
I can do this quite easily with a for loop.
window <- 3
for (i in 1:nrow(dat)){
# Create index to search (if i is less than window, start at 1)
index <- max(i - window, 1):i
# Search when current row matches both team and venue
dat$VenueCount[i] <- sum(dat$Team[i] == dat$Team[index] & dat$Venue[i] == dat$Venue[index])
}
Round Team Venue VenueCount
1 1 Team 1 Venue B 1
2 2 Team 1 Venue B 2
3 3 Team 1 Venue A 1
4 4 Team 1 Venue A 2
5 5 Team 1 Venue B 2
6 1 Team 2 Venue B 1
7 2 Team 2 Venue B 2
8 3 Team 2 Venue A 1
9 4 Team 2 Venue A 2
10 5 Team 2 Venue A 3
11 1 Team 3 Venue B 1
12 2 Team 3 Venue A 1
13 3 Team 3 Venue B 2
14 4 Team 3 Venue B 3
15 5 Team 3 Venue B 3
However, I'm wanting to avoid a for loop (mostly as my actual dataset is relatively large at around ~30k rows). I'm thinking it should be doable with one of zoo
, dplyr
, purrr
or apply
but haven't been able to work it out.
Thanks
venturing a data.table
solution here. Will take it down if you are only looking for dplyr
solution
You can roll using a window of size 4, then count the number of occurrence matching the latest row.
library(data.table)
library(zoo)
setDT(test)
winsize <- 4
test[, .(Round,
Venue,
VenueCount=rollapplyr(c(rep("", winsize-1), Venue), winsize,
function(x) sum(x==last(x)))),
by=.(Team)]
result:
# Team Round Venue VenueCount
# 1: Team 1 1 Venue B 1
# 2: Team 1 2 Venue B 2
# 3: Team 1 3 Venue A 1
# 4: Team 1 4 Venue A 2
# 5: Team 1 5 Venue B 2
# 6: Team 2 1 Venue B 1
# 7: Team 2 2 Venue B 2
# 8: Team 2 3 Venue A 1
# 9: Team 2 4 Venue A 2
# 10: Team 2 5 Venue A 3
# 11: Team 3 1 Venue B 1
# 12: Team 3 2 Venue A 1
# 13: Team 3 3 Venue B 2
# 14: Team 3 4 Venue B 3
# 15: Team 3 5 Venue B 3
I actually worked out an answer using rollify
from the tibbletime
package with dplyr::mutate
. Will post here but still open to other responses!
library(dplyr)
library(tibbletime)
# Create data
set.seed(123)
test <- data.frame(
Round = rep(1:5, times = 3),
Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)
Use rollify
to create custom function.
last_n_games = 3
count_games <- rollify(function(x) sum(last(x) == x), window = last_n_games)
Now use mutate to run the function. This returns NA for the first 2 rows (i.e. last_n_games - 1
). I can then use group_by
and row_number
to count these first occurrences
test <- test %>%
group_by(Team) %>%
mutate(VenueCount = count_games(Venue)) %>%
group_by(Team, Venue) %>%
mutate(VenueCount = ifelse(is.na(VenueCount), row_number(Team), VenueCount))
This returns the following
# A tibble: 15 x 4
# Groups: Team, Venue [6]
Round Team Venue VenueCount
<int> <fct> <fct> <int>
1 1 Team 1 Venue B 1
2 2 Team 1 Venue B 2
3 3 Team 1 Venue A 1
4 4 Team 1 Venue A 2
5 5 Team 1 Venue B 1
6 1 Team 2 Venue B 1
7 2 Team 2 Venue B 2
8 3 Team 2 Venue A 1
9 4 Team 2 Venue A 2
10 5 Team 2 Venue A 3
11 1 Team 3 Venue B 1
12 2 Team 3 Venue A 1
13 3 Team 3 Venue B 2
14 4 Team 3 Venue B 2
15 5 Team 3 Venue B 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With