I've got a relatively simple problem (I think) and I want to solve it in a fast and efficient way.
I want to count the number of different elements in a vector up to each point in this vector.
For example, in a vector like this
vec <- c("a", "b", "c", "a", "a", "c", "d", "a")
I want to get the following vector of equal size as a result:
[1 2 3 3 3 3 4 4]
I could solve this of course with a for
loop in combination with cumsum()
:
vec <- c("a", "b", "c", "a", "a", "c", "d", "a")
res <- T
for (i in 2:length(vec)) {
res[i] <- !(vec[i] %in% vec[1:(i-1)])
}
cumsum(res)
[1] 1 2 3 3 3 3 4 4
However, I am dealing with vectors that have several million elements and a for-loop approach takes forever for such a relatively simple problem.
I have the intuition that this should be solvable much faster and more clever. Do you have any ideas? Thank you!
(In case you're interested: I need this for a vocabulary growth curve analysis where we want to know at each point in the text how many different words, i.e. types, have been observed so far.)
Use cumsum
on the non (!
) duplicated
values:
cumsum(!duplicated(vec))
#[1] 1 2 3 3 3 3 4 4
And another approach with match
:
uni <- vector(length = length(vec))
uni[match(unique(vec), vec)] <- TRUE
cumsum(uni)
I believe the duplicated
approach by @Maël is the most efficient, cheers!
An alternative could be using cummax
+ factor
(or match
), e.g., but it would be slower than cumsum
+ duplicated
> cummax(as.integer(factor(vec, levels = unique(vec))))
[1] 1 2 3 3 3 3 4 4
> cummax(match(vec, unique(vec)))
[1] 1 2 3 3 3 3 4 4
and benchmarking
set.seed(0)
vec <- sample(c(letters, LETTERS), 1e6, TRUE)
microbenchmark(
cumsum = cumsum(!duplicated(vec)),
cummax1 = cummax(as.integer(factor(vec, levels = unique(vec)))),
cummax2 = cummax(match(vec, unique(vec))),
check = "equivalent",
unit = "relative"
)
shows
Unit: relative
expr min lq mean median uq max neval
cumsum 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
cummax1 1.945930 1.879805 1.577003 1.684878 1.544305 0.7344597 100
cummax2 1.974178 1.870829 1.611623 1.713824 1.510238 1.3714560 100
Here is another way, not efficient, but you can keep track of the cumulative updating
> Reduce(union, vec, accumulate = TRUE)
[[1]]
[1] "a"
[[2]]
[1] "a" "b"
[[3]]
[1] "a" "b" "c"
[[4]]
[1] "a" "b" "c"
[[5]]
[1] "a" "b" "c"
[[6]]
[1] "a" "b" "c"
[[7]]
[1] "a" "b" "c" "d"
[[8]]
[1] "a" "b" "c" "d"
and the length growth
> lengths(Reduce(union, vec, accumulate = TRUE))
[1] 1 2 3 3 3 3 4 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With