If I have a dataframe like this:
neu <- data.frame(test1 = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14),
test2 = c("a","b","a","b","c","c","a","c","c","d","d","f","f","f"))
neu
test1 test2
1 1 a
2 2 b
3 3 a
4 4 b
5 5 c
6 6 c
7 7 a
8 8 c
9 9 c
10 10 d
11 11 d
12 12 f
13 13 f
14 14 f
and I would like to select only those values where the level of the factor test2
appears more than let's say three times, what would be the fastest way?
Thanks very much, didn't really find the right answer in the previous questions.
Find the rows using:
z <- table(neu$test2)[table(neu$test2) >= 3] # repeats greater than or equal to 3 times
Or:
z <- names(which(table(neu$test2)>=3))
Then subset with:
subset(neu, test2 %in% names(z))
Or:
neu[neu$test2 %in% names(z),]
Here's another way:
with(neu, neu[ave(seq(test2), test2, FUN=length) > 3, ])
# test1 test2
# 5 5 c
# 6 6 c
# 8 8 c
# 9 9 c
I'd use count
from the plyr
package to perform the counting:
library(plyr)
count_result = count(neu, "test2")
matching = with(count_result, test2[freq > 3])
with(neu, test1[test2 %in% matching])
[1] 5 6 8 9
The (better scaling) data.table
way:
library(data.table)
dt = data.table(neu)
dt[dt[, .I[.N >= 3], by = test2]$V1]
Note: hopefully, in the future, the following simpler syntax will be the fast way of doing this:
dt[, .SD[.N >= 3], by = test2]
(c.f. Subset by group with data.table)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With