Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subsetting a dataframe by the amount of repetition [duplicate]

Tags:

r

If I have a dataframe like this:

neu <- data.frame(test1 = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14), 
                  test2 = c("a","b","a","b","c","c","a","c","c","d","d","f","f","f"))
neu
   test1 test2
1      1     a
2      2     b
3      3     a
4      4     b
5      5     c
6      6     c
7      7     a
8      8     c
9      9     c
10    10     d
11    11     d
12    12     f
13    13     f
14    14     f

and I would like to select only those values where the level of the factor test2 appears more than let's say three times, what would be the fastest way?

Thanks very much, didn't really find the right answer in the previous questions.

like image 814
Miri Putzig Avatar asked May 16 '13 11:05

Miri Putzig


4 Answers

Find the rows using:

z <- table(neu$test2)[table(neu$test2) >= 3] # repeats greater than or equal to 3 times

Or:

z <- names(which(table(neu$test2)>=3))

Then subset with:

subset(neu, test2 %in% names(z))

Or:

neu[neu$test2 %in% names(z),]
like image 137
Thomas Avatar answered Oct 29 '22 21:10

Thomas


Here's another way:

 with(neu, neu[ave(seq(test2), test2, FUN=length) > 3, ])

#   test1 test2
# 5     5     c
# 6     6     c
# 8     8     c
# 9     9     c
like image 24
Matthew Plourde Avatar answered Oct 29 '22 21:10

Matthew Plourde


I'd use count from the plyr package to perform the counting:

library(plyr)
count_result = count(neu, "test2")
matching = with(count_result, test2[freq > 3])
with(neu, test1[test2 %in% matching])
[1] 5 6 8 9
like image 3
Paul Hiemstra Avatar answered Oct 29 '22 20:10

Paul Hiemstra


The (better scaling) data.table way:

library(data.table)
dt = data.table(neu)

dt[dt[, .I[.N >= 3], by = test2]$V1]

Note: hopefully, in the future, the following simpler syntax will be the fast way of doing this:

dt[, .SD[.N >= 3], by = test2]

(c.f. Subset by group with data.table)

like image 2
eddi Avatar answered Oct 29 '22 22:10

eddi