I have a data frame from several experiments. I am looking to calculate cumulative number of unique values obtained after each successive experiment.
For example, consider:
test <- data.frame(exp = c( rep("exp1" , 4) , rep("exp2" , 4), rep("exp3" , 4) , rep("exp4" , 5) ) ,
entries = c("abcd","efgh","ijkl","mnop", "qrst" , "uvwx" , "abcd","efgh","ijkl" , "qrst" , "uvwx",
"yzab" , "yzab" , "cdef" , "mnop" , "uvwx" , "ghij"))
> test
exp entries
1 exp1 abcd
2 exp1 efgh
3 exp1 ijkl
4 exp1 mnop
5 exp2 qrst
6 exp2 uvwx
7 exp2 abcd
8 exp2 efgh
9 exp3 ijkl
10 exp3 qrst
11 exp3 uvwx
12 exp3 yzab
13 exp4 yzab
14 exp4 cdef
15 exp4 mnop
16 exp4 uvwx
17 exp4 ghij
total number of unique entries are nine. Now I want the result to look like:
exp cum_unique_entries
1 exp1 4
2 exp2 6
3 exp3 7
4 exp4 9
Finally I want to plot this in the form of a barplot. I can do this with for loops approach, but feel there has to be more elegant way.
Here's another solution with dplyr
:
library(dplyr)
test %>%
mutate(cum_unique_entries = cumsum(!duplicated(entries))) %>%
group_by(exp) %>%
slice(n()) %>%
select(-entries)
or
test %>%
mutate(cum_unique_entries = cumsum(!duplicated(entries))) %>%
group_by(exp) %>%
summarise(cum_unique_entries = last(cum_unique_entries))
Result:
# A tibble: 4 x 2
exp cum_unique_entries
<fctr> <int>
1 exp1 4
2 exp2 6
3 exp3 7
4 exp4 9
Note:
First find the cumulative sum of all non-duplicates (cumsum(!duplicated(entries))
), group_by
exp
, and take the last cumsum
of each group, this number would be the cumulative unique entries for each group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With