Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of occurrences of vector in list

Tags:

r

I have a list of vectors of variable length, for example:

q <- list(c(1,3,5), c(2,4), c(1,3,5), c(2,5), c(7), c(2,5))

I need to count the number of occurrences for each of the vectors in the list, for example (any other suitable datastructure acceptable):

list(list(c(1,3,5), 2), list(c(2,4), 1), list(c(2,5), 2), list(c(7), 1))

Is there an efficient way to do this? The actual list has tens of thousands of items so quadratic behaviour is not feasible.

like image 747
Jan Šimbera Avatar asked Sep 07 '16 14:09

Jan Šimbera


People also ask

How do you find the number of occurrences in a list?

Using the count() Function The "standard" way (no external libraries) to get the count of word occurrences in a list is by using the list object's count() function. The count() method is a built-in function that takes an element as its only argument and returns the number of times that element appears in the list.

How do you count occurrences in a list in Python?

The easiest way to count the number of occurrences in a Python list of a given item is to use the Python . count() method. The method is applied to a given list and takes a single argument. The argument passed into the method is counted and the number of occurrences of that item in the list is returned.

How do you count the number of repeated elements in a list in Python?

Operator. countOf() is used for counting the number of occurrences of b in a. It counts the number of occurrences of value. It returns the Count of a number of occurrences of value.


2 Answers

match and unique accept and handle "list"s too (?match warns for being slow on "list"s). So, with:

match(q, unique(q))
#[1] 1 2 1 3 4 3

each element is mapped to a single integer. Then:

tabulate(match(q, unique(q)))
#[1] 2 1 2 1

And find a structure to present the results:

as.data.frame(cbind(vec = unique(q), n = tabulate(match(q, unique(q)))))
#      vec n
#1 1, 3, 5 2
#2    2, 4 1
#3    2, 5 2
#4       7 1

Alternatively to match(x, unique(x)) approach, we could map each element to a single value with deparseing:

table(sapply(q, deparse))
#
#         7 c(1, 3, 5)    c(2, 4)    c(2, 5) 
#         1          2          1          2

Also, since this is a case with unique integers, and assuming in a small range, we could map each element to a single integer after transforming each element to a binary representation:

n = max(unlist(q))
pow2 = 2 ^ (0:(n - 1))
sapply(q, function(x) tabulate(x, nbins = n))  # 'binary' form
sapply(q, function(x) sum(tabulate(x, nbins = n) * pow2))
#[1] 21 10 21 18 64 18

and then tabulate as before.

And just to compare the above alternatives:

f1 = function(x) 
{
    ux = unique(x)
    i = match(x, ux)
    cbind(vec = ux, n = tabulate(i))
}   

f2 = function(x)
{
    xc = sapply(x, deparse)
    i = match(xc, unique(xc))
    cbind(vec = x[!duplicated(i)], n = tabulate(i))
}  

f3 = function(x)
{
    n = max(unlist(x))
    pow2 = 2 ^ (0:(n - 1))
    v = sapply(x, function(X) sum(tabulate(X, nbins = n) * pow2))
    i = match(v, unique(v))
    cbind(vec = x[!duplicated(v)], n = tabulate(i))
}

q2 = rep_len(q, 1e3)

all.equal(f1(q2), f2(q2))
#[1] TRUE
all.equal(f2(q2), f3(q2))
#[1] TRUE

microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2))
#Unit: milliseconds
#   expr       min        lq      mean    median        uq       max neval cld
# f1(q2)  7.980041  8.161524 10.525946  8.291678  8.848133 178.96333   100  b 
# f2(q2) 24.407143 24.964991 27.311056 25.514834 27.538643  45.25388   100   c
# f3(q2)  3.951567  4.127482  4.688778  4.261985  4.518463  10.25980   100 a 

Another interesting alternative is based on ordering. R > 3.3.0 has a grouping function, built off data.table, which, along with the ordering, provides some attributes for further manipulation:

Make all elements of equal length and "transpose" (probably the most slow operation in this case, though I'm not sure how else to feed grouping):

n = max(lengths(q))
qq = .mapply(c, lapply(q, "[", seq_len(n)), NULL)

Use ordering to group similar elements mapped to integers:

gr = do.call(grouping, qq)
e = attr(gr, "ends") 
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
i
#[1] 1 2 1 3 4 3

then, tabulate as before. To continue the comparisons:

f4 = function(x)
{
    n = max(lengths(x))
    x2 = .mapply(c, lapply(x, "[", seq_len(n)), NULL)
    gr = do.call(grouping, x2)
    e = attr(gr, "ends") 
    i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
    cbind(vec = x[!duplicated(i)], n = tabulate(i))
}

all.equal(f3(q2), f4(q2))
#[1] TRUE

microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2), f4(q2))
#Unit: milliseconds
#   expr       min        lq      mean    median        uq        max neval cld
# f1(q2)  7.956377  8.048250  8.792181  8.131771  8.270101  21.944331   100  b 
# f2(q2) 24.228966 24.618728 28.043548 25.031807 26.188219 195.456203   100   c
# f3(q2)  3.963746  4.103295  4.801138  4.179508  4.360991  35.105431   100 a  
# f4(q2)  2.874151  2.985512  3.219568  3.066248  3.186657   7.763236   100 a

In this comparison q's elements are of small length to accomodate for f3, but f3 (because of large exponentiation) and f4 (because of mapply) will suffer, in performance, if "list"s of larger elements are used.

like image 58
alexis_laz Avatar answered Nov 04 '22 22:11

alexis_laz


One way is to paste each vector , unlist and tabulate, i.e.

table(unlist(lapply(q, paste, collapse = ',')))

#1,3,5   2,4   2,5     7 
#    2     1     2     1 
like image 24
Sotos Avatar answered Nov 04 '22 22:11

Sotos