I have a list of vectors of variable length, for example:
q <- list(c(1,3,5), c(2,4), c(1,3,5), c(2,5), c(7), c(2,5))
I need to count the number of occurrences for each of the vectors in the list, for example (any other suitable datastructure acceptable):
list(list(c(1,3,5), 2), list(c(2,4), 1), list(c(2,5), 2), list(c(7), 1))
Is there an efficient way to do this? The actual list has tens of thousands of items so quadratic behaviour is not feasible.
Using the count() Function The "standard" way (no external libraries) to get the count of word occurrences in a list is by using the list object's count() function. The count() method is a built-in function that takes an element as its only argument and returns the number of times that element appears in the list.
The easiest way to count the number of occurrences in a Python list of a given item is to use the Python . count() method. The method is applied to a given list and takes a single argument. The argument passed into the method is counted and the number of occurrences of that item in the list is returned.
Operator. countOf() is used for counting the number of occurrences of b in a. It counts the number of occurrences of value. It returns the Count of a number of occurrences of value.
match
and unique
accept and handle "list"s too (?match
warns for being slow on "list"s). So, with:
match(q, unique(q))
#[1] 1 2 1 3 4 3
each element is mapped to a single integer. Then:
tabulate(match(q, unique(q)))
#[1] 2 1 2 1
And find a structure to present the results:
as.data.frame(cbind(vec = unique(q), n = tabulate(match(q, unique(q)))))
# vec n
#1 1, 3, 5 2
#2 2, 4 1
#3 2, 5 2
#4 7 1
Alternatively to match(x, unique(x))
approach, we could map each element to a single value with deparse
ing:
table(sapply(q, deparse))
#
# 7 c(1, 3, 5) c(2, 4) c(2, 5)
# 1 2 1 2
Also, since this is a case with unique integers, and assuming in a small range, we could map each element to a single integer after transforming each element to a binary representation:
n = max(unlist(q))
pow2 = 2 ^ (0:(n - 1))
sapply(q, function(x) tabulate(x, nbins = n)) # 'binary' form
sapply(q, function(x) sum(tabulate(x, nbins = n) * pow2))
#[1] 21 10 21 18 64 18
and then tabulate
as before.
And just to compare the above alternatives:
f1 = function(x)
{
ux = unique(x)
i = match(x, ux)
cbind(vec = ux, n = tabulate(i))
}
f2 = function(x)
{
xc = sapply(x, deparse)
i = match(xc, unique(xc))
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
f3 = function(x)
{
n = max(unlist(x))
pow2 = 2 ^ (0:(n - 1))
v = sapply(x, function(X) sum(tabulate(X, nbins = n) * pow2))
i = match(v, unique(v))
cbind(vec = x[!duplicated(v)], n = tabulate(i))
}
q2 = rep_len(q, 1e3)
all.equal(f1(q2), f2(q2))
#[1] TRUE
all.equal(f2(q2), f3(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.980041 8.161524 10.525946 8.291678 8.848133 178.96333 100 b
# f2(q2) 24.407143 24.964991 27.311056 25.514834 27.538643 45.25388 100 c
# f3(q2) 3.951567 4.127482 4.688778 4.261985 4.518463 10.25980 100 a
Another interesting alternative is based on ordering. R > 3.3.0 has a grouping
function, built off data.table, which, along with the ordering, provides some attributes for further manipulation:
Make all elements of equal length and "transpose" (probably the most slow operation in this case, though I'm not sure how else to feed grouping
):
n = max(lengths(q))
qq = .mapply(c, lapply(q, "[", seq_len(n)), NULL)
Use ordering to group similar elements mapped to integers:
gr = do.call(grouping, qq)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
i
#[1] 1 2 1 3 4 3
then, tabulate as before. To continue the comparisons:
f4 = function(x)
{
n = max(lengths(x))
x2 = .mapply(c, lapply(x, "[", seq_len(n)), NULL)
gr = do.call(grouping, x2)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
all.equal(f3(q2), f4(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2), f4(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.956377 8.048250 8.792181 8.131771 8.270101 21.944331 100 b
# f2(q2) 24.228966 24.618728 28.043548 25.031807 26.188219 195.456203 100 c
# f3(q2) 3.963746 4.103295 4.801138 4.179508 4.360991 35.105431 100 a
# f4(q2) 2.874151 2.985512 3.219568 3.066248 3.186657 7.763236 100 a
In this comparison q
's elements are of small length to accomodate for f3
, but f3
(because of large exponentiation) and f4
(because of mapply
) will suffer, in performance, if "list"s of larger elements are used.
One way is to paste each vector , unlist and tabulate, i.e.
table(unlist(lapply(q, paste, collapse = ',')))
#1,3,5 2,4 2,5 7
# 2 1 2 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With