Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count occurences of lists efficiently

I have a list containing a few millions of lists, these sublists have a few distinct possible values, maybe 10 to 100.

I want to count the number of occurrences of these values.

The code below works but it is very slow. Can we do this faster ?

count_by_list <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
  unique_lst <- unique(lst)
  res <- tibble::tibble(!!var_nm := unique_lst, !!count_nm := NA)
  for(i in seq_along(unique_lst)){
    res[[count_nm]][[i]] <- sum(lst %in% res[[var_nm]][i])
  }
  res
}

x <- list(
  list(a=1, b=2),
  list(a=1, b=2),
  list(b=3),
  list(b=3, c=4))

count_by_list(x)
#> # A tibble: 3 x 2
#>   x                    n
#>   <list>           <int>
#> 1 <named list [2]>     2
#> 2 <named list [1]>     1
#> 3 <named list [2]>     1

Created on 2019-11-29 by the reprex package (v0.3.0)

I tried hashing with the library digest but it was actually slower, and getting worse as n increases :

library(digest)
count_by_list2 <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
  unique_lst <- unique(lst)
  digested   <- vapply(lst, digest, character(1))
  res        <- as.data.frame(table(digested))
  names(res) <- c(var_nm, count_nm)
  res[[1]] <- unique_lst
  res
}

If you need to benchmark you can use x_big <- unlist(replicate(10000 ,x, F), recursive = FALSE).

I added the tags rcpp and parallel processing as these might help, these are not constraints on the answers.

like image 718
Moody_Mudskipper Avatar asked Nov 29 '19 15:11

Moody_Mudskipper


2 Answers

This data.table approach is 30 times faster than OP original loop for the x_big example. One notable precaution is that if any element of a sublist contains more than one record, this approach would fail.

library(data.table)

molten_lst <- rbindlist(x, fill = T)
cnt_lst <- molten_lst[, .N, names(molten_lst)]

tibble(x = cnt_lst[, 
                   list(apply(.SD, 1, function(x) as.list(na.omit(x)))),
                   .SDcols = names(molten_lst),
                   by = .(seq_len(nrow(cnt_lst)))]$V1,
           n = cnt_lst[['N']])

Here are two backup approaches. I ran into NSE / quasi-quotation issues, so the !!var_nam was simplified. The first approach is some tweaks to your original function - primarily by filtering the lst during the loop.

enhanced_loop <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
  unique_lst <- unique(lst)
  cnts <- vector('integer', length(unique_lst))

  for (i in seq_along(unique_lst)[-length(unique_lst)]){
    ind <- lst %in% unique_lst[i]
    lst <- lst[!ind]
    cnts[i] <- sum(ind)
  }
  cnts[length(unique_lst)] <- length(lst)
  tibble::tibble(x := unique_lst, !!count_nm := cnts)
}

And this takes the loop to the logical conclusion - using match() instead of %in% so effort is not duplicated:

tabulate_match <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
  unique_lst <- unique(lst)
  cnts <- tabulate(match(lst, unique_lst))
  tibble::tibble(x := unique_lst, !!count_nm := cnts)
}

Performance:

# A tibble: 7 x 13
  expression                min  median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>              <bch> <bch:t>     <dbl> <bch:byt>    <dbl> <int>
1 molten_dt                25ms  25.1ms     39.7     2.71MB     0        5
2 tabulate_match(x_big)   237ms 247.2ms      3.41    1.42MB     2.05     5
3 enhanced_loop(x_big)    344ms 352.6ms      2.82    2.83MB     1.69     5
4 table_sapply            381ms 384.9ms      2.59    3.76MB     7.77     5
5 vapply_tab_match(x_big) 412ms 429.3ms      2.14    4.21MB     3.85     5
6 dt_thing(x_big)         442ms 464.6ms      2.15    2.83MB     7.31     5
7 count_by_list(x_big)    759ms 768.4ms      1.24     3.4MB     2.23     5
like image 76
Cole Avatar answered Sep 24 '22 22:09

Cole


Here is something quick and dirty that shaves off the original solution.

cbl2 <- function(x) {
  xcv <- vapply(seq_along(x), function(i) paste(x[i]), character(1))
  xcv_count <- table(match(xcv, xcv))
  tibble(x = x[as.integer(names(xcv_count))], n = as.vector(xcv_count))  
}

Some playing around with data.table again shortened the run time:

cbl3 <- function(x) {
  data.table(xlist = x)[, xstring := paste(xlist), by = 1:length(x)
                        ][, .(x = xlist[1], .N), by = xstring
                          ][, .(x, n = N)
                            ][, as_tibble(.SD)]
}
like image 39
sindri_baldur Avatar answered Sep 23 '22 22:09

sindri_baldur