Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number of different elements up to this point

Tags:

r

unique

vector

I've got a relatively simple problem (I think) and I want to solve it in a fast and efficient way.

I want to count the number of different elements in a vector up to each point in this vector.

For example, in a vector like this

vec <- c("a", "b", "c", "a", "a", "c", "d", "a")

I want to get the following vector of equal size as a result: [1 2 3 3 3 3 4 4]

I could solve this of course with a for loop in combination with cumsum():

vec <- c("a", "b", "c", "a", "a", "c", "d", "a")
res <- T
for (i in 2:length(vec)) {
  res[i] <- !(vec[i] %in% vec[1:(i-1)])
}
cumsum(res)
[1] 1 2 3 3 3 3 4 4

However, I am dealing with vectors that have several million elements and a for-loop approach takes forever for such a relatively simple problem.

I have the intuition that this should be solvable much faster and more clever. Do you have any ideas? Thank you!

(In case you're interested: I need this for a vocabulary growth curve analysis where we want to know at each point in the text how many different words, i.e. types, have been observed so far.)

like image 937
swolf Avatar asked Oct 12 '25 16:10

swolf


2 Answers

Use cumsum on the non (!) duplicated values:

cumsum(!duplicated(vec))
#[1] 1 2 3 3 3 3 4 4

And another approach with match:

uni <- vector(length = length(vec))
uni[match(unique(vec), vec)] <- TRUE
cumsum(uni)
like image 182
Maël Avatar answered Oct 14 '25 04:10

Maël


I believe the duplicated approach by @Maël is the most efficient, cheers!

An alternative could be using cummax + factor (or match), e.g., but it would be slower than cumsum + duplicated

> cummax(as.integer(factor(vec, levels = unique(vec))))
[1] 1 2 3 3 3 3 4 4

> cummax(match(vec, unique(vec)))
[1] 1 2 3 3 3 3 4 4

and benchmarking

set.seed(0)
vec <- sample(c(letters, LETTERS), 1e6, TRUE)
microbenchmark(
    cumsum = cumsum(!duplicated(vec)),
    cummax1 = cummax(as.integer(factor(vec, levels = unique(vec)))),
    cummax2 = cummax(match(vec, unique(vec))),
    check = "equivalent",
    unit = "relative"
)

shows

Unit: relative
    expr      min       lq     mean   median       uq       max neval
  cumsum 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100
 cummax1 1.945930 1.879805 1.577003 1.684878 1.544305 0.7344597   100
 cummax2 1.974178 1.870829 1.611623 1.713824 1.510238 1.3714560   100

Here is another way, not efficient, but you can keep track of the cumulative updating

> Reduce(union, vec, accumulate = TRUE)
[[1]]
[1] "a"

[[2]]
[1] "a" "b"

[[3]]
[1] "a" "b" "c"

[[4]]
[1] "a" "b" "c"

[[5]]
[1] "a" "b" "c"

[[6]]
[1] "a" "b" "c"

[[7]]
[1] "a" "b" "c" "d"

[[8]]
[1] "a" "b" "c" "d"

and the length growth

> lengths(Reduce(union, vec, accumulate = TRUE))
[1] 1 2 3 3 3 3 4 4
like image 44
ThomasIsCoding Avatar answered Oct 14 '25 04:10

ThomasIsCoding