Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorizing a for loop in R that uses the unique function

player_ids = c(34, 87, 27, 34, 87, 9, 29, 25, 24, 25, 34, 37)
end = length(player_ids)
unique_players_list = list()

for(i in 1:end) {
  unique_players_list[[i]] = unique(player_ids_unlisted[1:i])
}

This is (a shortened version of) the for loop I am trying to vectorize. I'm not sure how to post code output, however the list unique_players_list should having the following output:

unique_players_list[[1]] == c(34)
unique_players_list[[2]] == c(34)
unique_players_list[[3]] == c(34, 87)
unique_players_list[[4]] == c(34, 87, 27)     
unique_players_list[[5]] == c(34, 87, 27)

"and so on. the output does not have to be in a list, and I would actually prefer a dataframe, however i need this vectorized because my current for loop takes forever and I need to run this code tens of thousands of times."

Thanks!

like image 286
Canovice Avatar asked Aug 01 '16 20:08

Canovice


1 Answers

A fairly literal implementation of the question is to lapply along the player ids, returning the unique elements of the head of the ids

f0 <- function(player_ids)
    lapply(seq_along(player_ids), function(i) unique(head(player_ids, i)))

This avoids the need to manage the allocation of the result list, and also handles the case when length(player_ids) == 0L. For a more efficient implementation, create the list of 'cummulative' sets

uid <- unique(player_ids)
sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])

then identify the set belonging to the ith index

did <- !duplicated(player_ids)
sets[cumsum(did)]

Here are some of the solutions so far

f1 <- function(player_ids) {
    end = length(player_ids)
    tank <- player_ids[1]

    unique_players_list = vector("list", end)
    for(i in 1:end) {
        if (!player_ids[i] %in% tank) tank <- c(tank, player_ids[i])
        unique_players_list[[i]] = tank
    }
    unique_players_list
}

f2 <- function(player_ids) {
    un = unique(player_ids)
    ma = match(un, player_ids)
    li = vector("list", length(player_ids))

    for (i in seq_along(player_ids))
        li[[i]] = un[ma <= i]
    li
}

f3 <- function(player_ids) {
    uid <- unique(player_ids)
    sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
    sets[cumsum(!duplicated(player_ids))]
}

Some basic tests that they are generating reasonable results

> identical(f1(player_ids), f2(player_ids))
[1] TRUE
> identical(f1(player_ids), f3(player_ids))
[1] TRUE

and an assessment of performance for a larger data set

> library(microbenchmark)
> ids <- sample(100, 10000, TRUE)
> microbenchmark(f1(ids), f2(ids), f3(ids), times=10)
Unit: microseconds
    expr       min        lq       mean     median        uq       max neval
 f1(ids) 24397.193 25820.375 32055.5720 26475.8245 28030.866 56487.781    10
 f2(ids) 20607.564 22148.888 34462.5850 24432.4785 51722.208 53473.468    10
 f3(ids)   414.649   458.271   772.3738   501.5185   686.383  2163.261    10

f3() does well when the vector of initial values is large compared to the number of unique values. Here's a data set where the elements in the original vector are mostly unique, and the timings more comparable

> ids <- sample(1000000, 10000, TRUE)
> microbenchmark(f1(ids), f2(ids), f3(ids), times=10)
Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval
 f1(ids) 214.2505 232.3902 233.7632 233.4617 237.5509 249.4652    10
 f2(ids) 433.5181 443.5987 512.4475 463.8388 467.3710 949.4882    10
 f3(ids) 299.2291 301.4931 307.7576 302.9375 316.6055 321.3942    10

It can be important to get the edge cases correct, a common problem being a zero-length vector, e.g., f2(integer()). f1() does not handle this case. Interestingly, I think all implementations are agnostic to the type of input, e.g., f1(sample(letters, 100, TRUE)) works.

Some off-line discussion lead to the suggestion that the return format is neither convenient nor memory efficient, and that duplicated() and unique() are somehow similar operations so we should be able to get away with a single call. This lead to the following solution, which returns a list of the unique identifiers and the offsets, for each player_id, to the end of the unique identifiers

f5 <- function(player_ids) {
    did <- !duplicated(player_ids)
    list(uid = player_ids[did], end_idx = cumsum(did))
}

The results are not directly comparable with identical() or similar. An updated f3() is

f3a <- function(player_ids) {
    did <- !duplicated(player_ids)
    uid <- player_ids[did]
    sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
    sets[cumsum(did)]
}

Here are a couple of metrics of performance

> ids <- sample(100, 10000, TRUE)
> print(object.size(f3(ids)), units="auto")
4.2 Mb
> print(object.size(f5(ids)), units="auto")
39.8 Kb
> microbenchmark(f3(ids), f3a(ids), f5(ids), times=10)
Unit: microseconds
    expr     min      lq     mean   median      uq     max neval
 f3(ids) 437.663 445.091 450.3965 447.3755 452.629 476.016    10
f3a(ids) 342.378 351.408 385.0844 354.2375 369.861 638.084    10
 f5(ids) 125.956 127.684 129.9898 128.5890 130.202 140.521    10

and

> ids <- sample(1000000, 10000, TRUE)
> microbenchmark(f3(ids), f3a(ids), f5(ids), times=10)
Unit: microseconds
     expr        min         lq         mean     median          uq         max
  f3(ids) 816317.361 821892.902  911862.5561 831274.596 1107496.984 1112586.295
 f3a(ids) 824593.618 827590.130 1009032.9519 829197.863  838559.619 2607916.641
  f5(ids)    213.677    270.397     313.1614    282.213     315.683     601.724
 neval
    10
    10
    10
like image 95
Martin Morgan Avatar answered Sep 19 '22 14:09

Martin Morgan