I have a list containing a few millions of lists, these sublists have a few distinct possible values, maybe 10 to 100.
I want to count the number of occurrences of these values.
The code below works but it is very slow. Can we do this faster ?
count_by_list <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
unique_lst <- unique(lst)
res <- tibble::tibble(!!var_nm := unique_lst, !!count_nm := NA)
for(i in seq_along(unique_lst)){
res[[count_nm]][[i]] <- sum(lst %in% res[[var_nm]][i])
}
res
}
x <- list(
list(a=1, b=2),
list(a=1, b=2),
list(b=3),
list(b=3, c=4))
count_by_list(x)
#> # A tibble: 3 x 2
#> x n
#> <list> <int>
#> 1 <named list [2]> 2
#> 2 <named list [1]> 1
#> 3 <named list [2]> 1
Created on 2019-11-29 by the reprex package (v0.3.0)
I tried hashing with the library digest
but it was actually slower, and getting worse as n increases :
library(digest)
count_by_list2 <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
unique_lst <- unique(lst)
digested <- vapply(lst, digest, character(1))
res <- as.data.frame(table(digested))
names(res) <- c(var_nm, count_nm)
res[[1]] <- unique_lst
res
}
If you need to benchmark you can use x_big <- unlist(replicate(10000 ,x, F), recursive = FALSE)
.
I added the tags rcpp
and parallel processing
as these might help, these are not constraints on the answers.
This data.table approach is 30 times faster than OP original loop for the x_big
example. One notable precaution is that if any element of a sublist contains more than one record, this approach would fail.
library(data.table)
molten_lst <- rbindlist(x, fill = T)
cnt_lst <- molten_lst[, .N, names(molten_lst)]
tibble(x = cnt_lst[,
list(apply(.SD, 1, function(x) as.list(na.omit(x)))),
.SDcols = names(molten_lst),
by = .(seq_len(nrow(cnt_lst)))]$V1,
n = cnt_lst[['N']])
Here are two backup approaches. I ran into NSE / quasi-quotation issues, so the !!var_nam
was simplified. The first approach is some tweaks to your original function - primarily by filtering the lst
during the loop.
enhanced_loop <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
unique_lst <- unique(lst)
cnts <- vector('integer', length(unique_lst))
for (i in seq_along(unique_lst)[-length(unique_lst)]){
ind <- lst %in% unique_lst[i]
lst <- lst[!ind]
cnts[i] <- sum(ind)
}
cnts[length(unique_lst)] <- length(lst)
tibble::tibble(x := unique_lst, !!count_nm := cnts)
}
And this takes the loop to the logical conclusion - using match()
instead of %in%
so effort is not duplicated:
tabulate_match <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
unique_lst <- unique(lst)
cnts <- tabulate(match(lst, unique_lst))
tibble::tibble(x := unique_lst, !!count_nm := cnts)
}
Performance:
# A tibble: 7 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch> <bch:t> <dbl> <bch:byt> <dbl> <int>
1 molten_dt 25ms 25.1ms 39.7 2.71MB 0 5
2 tabulate_match(x_big) 237ms 247.2ms 3.41 1.42MB 2.05 5
3 enhanced_loop(x_big) 344ms 352.6ms 2.82 2.83MB 1.69 5
4 table_sapply 381ms 384.9ms 2.59 3.76MB 7.77 5
5 vapply_tab_match(x_big) 412ms 429.3ms 2.14 4.21MB 3.85 5
6 dt_thing(x_big) 442ms 464.6ms 2.15 2.83MB 7.31 5
7 count_by_list(x_big) 759ms 768.4ms 1.24 3.4MB 2.23 5
Here is something quick and dirty that shaves off the original solution.
cbl2 <- function(x) {
xcv <- vapply(seq_along(x), function(i) paste(x[i]), character(1))
xcv_count <- table(match(xcv, xcv))
tibble(x = x[as.integer(names(xcv_count))], n = as.vector(xcv_count))
}
Some playing around with data.table
again shortened the run time:
cbl3 <- function(x) {
data.table(xlist = x)[, xstring := paste(xlist), by = 1:length(x)
][, .(x = xlist[1], .N), by = xstring
][, .(x, n = N)
][, as_tibble(.SD)]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With