player_ids = c(34, 87, 27, 34, 87, 9, 29, 25, 24, 25, 34, 37)
end = length(player_ids)
unique_players_list = list()
for(i in 1:end) {
unique_players_list[[i]] = unique(player_ids_unlisted[1:i])
}
This is (a shortened version of) the for loop I am trying to vectorize. I'm not sure how to post code output, however the list unique_players_list should having the following output:
unique_players_list[[1]] == c(34)
unique_players_list[[2]] == c(34)
unique_players_list[[3]] == c(34, 87)
unique_players_list[[4]] == c(34, 87, 27)
unique_players_list[[5]] == c(34, 87, 27)
"and so on. the output does not have to be in a list, and I would actually prefer a dataframe, however i need this vectorized because my current for loop takes forever and I need to run this code tens of thousands of times."
Thanks!
A fairly literal implementation of the question is to lapply along the player ids, returning the unique elements of the head of the ids
f0 <- function(player_ids)
lapply(seq_along(player_ids), function(i) unique(head(player_ids, i)))
This avoids the need to manage the allocation of the result list, and also handles the case when length(player_ids) == 0L
. For a more efficient implementation, create the list of 'cummulative' sets
uid <- unique(player_ids)
sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
then identify the set belonging to the ith index
did <- !duplicated(player_ids)
sets[cumsum(did)]
Here are some of the solutions so far
f1 <- function(player_ids) {
end = length(player_ids)
tank <- player_ids[1]
unique_players_list = vector("list", end)
for(i in 1:end) {
if (!player_ids[i] %in% tank) tank <- c(tank, player_ids[i])
unique_players_list[[i]] = tank
}
unique_players_list
}
f2 <- function(player_ids) {
un = unique(player_ids)
ma = match(un, player_ids)
li = vector("list", length(player_ids))
for (i in seq_along(player_ids))
li[[i]] = un[ma <= i]
li
}
f3 <- function(player_ids) {
uid <- unique(player_ids)
sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
sets[cumsum(!duplicated(player_ids))]
}
Some basic tests that they are generating reasonable results
> identical(f1(player_ids), f2(player_ids))
[1] TRUE
> identical(f1(player_ids), f3(player_ids))
[1] TRUE
and an assessment of performance for a larger data set
> library(microbenchmark)
> ids <- sample(100, 10000, TRUE)
> microbenchmark(f1(ids), f2(ids), f3(ids), times=10)
Unit: microseconds
expr min lq mean median uq max neval
f1(ids) 24397.193 25820.375 32055.5720 26475.8245 28030.866 56487.781 10
f2(ids) 20607.564 22148.888 34462.5850 24432.4785 51722.208 53473.468 10
f3(ids) 414.649 458.271 772.3738 501.5185 686.383 2163.261 10
f3()
does well when the vector of initial values is large compared to the number of unique values. Here's a data set where the elements in the original vector are mostly unique, and the timings more comparable
> ids <- sample(1000000, 10000, TRUE)
> microbenchmark(f1(ids), f2(ids), f3(ids), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f1(ids) 214.2505 232.3902 233.7632 233.4617 237.5509 249.4652 10
f2(ids) 433.5181 443.5987 512.4475 463.8388 467.3710 949.4882 10
f3(ids) 299.2291 301.4931 307.7576 302.9375 316.6055 321.3942 10
It can be important to get the edge cases correct, a common problem being a zero-length vector, e.g., f2(integer())
. f1()
does not handle this case. Interestingly, I think all implementations are agnostic to the type of input, e.g., f1(sample(letters, 100, TRUE))
works.
Some off-line discussion lead to the suggestion that the return format is neither convenient nor memory efficient, and that duplicated()
and unique()
are somehow similar operations so we should be able to get away with a single call. This lead to the following solution, which returns a list of the unique identifiers and the offsets, for each player_id, to the end of the unique identifiers
f5 <- function(player_ids) {
did <- !duplicated(player_ids)
list(uid = player_ids[did], end_idx = cumsum(did))
}
The results are not directly comparable with identical()
or similar. An updated f3()
is
f3a <- function(player_ids) {
did <- !duplicated(player_ids)
uid <- player_ids[did]
sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
sets[cumsum(did)]
}
Here are a couple of metrics of performance
> ids <- sample(100, 10000, TRUE)
> print(object.size(f3(ids)), units="auto")
4.2 Mb
> print(object.size(f5(ids)), units="auto")
39.8 Kb
> microbenchmark(f3(ids), f3a(ids), f5(ids), times=10)
Unit: microseconds
expr min lq mean median uq max neval
f3(ids) 437.663 445.091 450.3965 447.3755 452.629 476.016 10
f3a(ids) 342.378 351.408 385.0844 354.2375 369.861 638.084 10
f5(ids) 125.956 127.684 129.9898 128.5890 130.202 140.521 10
and
> ids <- sample(1000000, 10000, TRUE)
> microbenchmark(f3(ids), f3a(ids), f5(ids), times=10)
Unit: microseconds
expr min lq mean median uq max
f3(ids) 816317.361 821892.902 911862.5561 831274.596 1107496.984 1112586.295
f3a(ids) 824593.618 827590.130 1009032.9519 829197.863 838559.619 2607916.641
f5(ids) 213.677 270.397 313.1614 282.213 315.683 601.724
neval
10
10
10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With