I came across a problem on r-bloggers that I thought I'd try out as a fun project. However, I'm seeing that the loop used in the purrr::map() function is a bottleneck in my code so I was wondering if anyone has any ideas on how I can vectorize the calculation of scores in the below code and avoid the loop?
https://www.r-bloggers.com/2024/04/joint-fiddlin/
The gist of the problem is that Alice and Bob gets points based on pairs in a sequence of coin tosses. Alice gets a point if there are two sequential heads while Bob gets a point if heads is followed by tails. So in a sequence of (H, H, T, H, H), Alice gets 2 points while Bob gets 1 point.
I'm using these libs other than base library(dplyr) # bind_rows library(purrr) # map
I'm using this function to create one game
fn_is_heads <- function(count, sample, sample_size){
ret <- sample(sample, size = sample_size, replace = TRUE, prob = c(0.5, 0.5))
return(ret)
}
Then I'm using this to create a list of games (in this case 100,000 games of 100 tosses)
trials <- 1e5
flip_nbr <- 100
flips <- map(1:trials, .f = fn_is_heads, sample = c(TRUE, FALSE), sample_size = flip_nbr)
I wrote this function to calculate the scores per game
fn_score_vec <- function(flips){
# alice gets points -> TRUE, TRUE
# bob gets points -> TRUE, FALSE
alice_pts <- sum(flips[-length(flips)] == TRUE & flips[-1] == TRUE)
bob_pts <- sum(flips[-length(flips)] == TRUE & flips[-1] == FALSE)
return(data.frame(alice = alice_pts, bob = bob_pts))
}
..and I use purrr::map() to iterate over the list of games and return the scores per game
score <- bind_rows(map(.x = flips, .f = fn_score_vec))
Calculating the scores takes about 11s on my pc. Does anyone have any suggestions on how to get rid of purrr::map() and speed up the calculation of the scores (other than calculating in parallel)?
With a fair coin using base R:
trials <- 1e5
flip_nbr <- 100
system.time({
flips <- matrix(sample(!(0:1), trials*flip_nbr, 1), trials)
score <- data.frame(
alice = rowSums(flips[,-flip_nbr] & flips[,-1]),
bob = rowSums(flips[,-flip_nbr] & !flips[,-1])
)
})
#> user system elapsed
#> 0.67 0.08 0.75
Using runif and matrixStats::rowSums2.
flips <- matrix(runif(trials*flip_nbr) > .5, trials)
data.frame(
alice=matrixStats::rowSums2(flips[, -flip_nbr] & flips[, -1]),
bob=matrixStats::rowSums2(flips[, -flip_nbr] & !flips[, -1])
)
Using Rcpp.
Rcpp::cppFunction('
DataFrame calculateSums(int trials, int flip_nbr) {
NumericVector rand = Rcpp::runif(trials*flip_nbr);
LogicalMatrix flips(trials, flip_nbr);
for (int i = 0; i < trials; ++i) {
for (int j = 0; j < flip_nbr; ++j) {
flips(i, j) = (rand[i*flip_nbr + j] > 0.5);
}
}
IntegerVector alice(trials), bob(trials);
for (int i = 0; i < trials; ++i) {
for (int j = 0; j < flip_nbr - 1; ++j) {
if (flips(i, j) && flips(i, j + 1)) {
alice[i]++;
}
if (flips(i, j) && !flips(i, j + 1)) {
bob[i]++;
}
}
}
return DataFrame::create(_["alice"] = alice, _["bob"] = bob);
}
')
calculateSums(trials, flip_nbr)
$ Rscript --vanilla foo.R
Unit: milliseconds
expr min lq mean median uq max neval cld
sample 1234.5062 1253.5345 1276.6813 1275.4404 1291.8341 1343.2751 10 a
runif 819.1189 833.8159 871.1204 854.9161 876.7099 991.1695 10 b
rcpp 255.5373 266.1039 288.9363 284.2602 303.4640 338.6769 10 c
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With