Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove IDs that occur x times R

Tags:

r

rows

I have a df and I would like to remove people who have less than X amount of rows in df. E.g., in this toy example, I would like to retain people who have >= 5 rows.

df
   names  fruit
4   john   kiwi
7   john  apple
9   john banana
13  john orange
14  john  apple
2   mary orange
5   mary  apple
8   mary orange
10  mary  apple
12  mary  apple
1    tom  apple
3    tom banana
6    tom  apple
11   tom   kiwi

example output

df
   names  fruit
4   john   kiwi
7   john  apple
9   john banana
13  john orange
14  john  apple
2   mary orange
5   mary  apple
8   mary orange
10  mary  apple
12  mary  apple

Thanks in advance!

like image 565
user2363642 Avatar asked Aug 18 '13 18:08

user2363642


2 Answers

You can use table like this:

df[df$names %in% names(table(df$names))[table(df$names) >= 5],]
like image 67
Roland Avatar answered Nov 18 '22 14:11

Roland


Here's a data.table solution using the in-built .N value, which is as described in the ?data.table help file: ‘.N’ is an integer, length 1, containing the number of rows in the group.

#create a similar reproducible exmaple
library(data.table)
dat <- data.table(names=rep(letters[1:3],c(5,5,3)),var=1:13)

Remove the rows:

dat[, cnt:=.N, by=names][cnt >= 5]

Though I feel like there must be a way to do this without assigning a new variable. ...And now there is thanks to @mnel in the comments:

dat[,if(.N>=5).SD,by=names]

This essentially returns a sub-data.table .SD for each value of the by group if the number of rows in the group .N is greater than or equal to 5. It is pretty much equivalent to the more traditional R subsetting syntax of:

dat[,.SD[.N >= 5],by=names]
like image 6
thelatemail Avatar answered Nov 18 '22 16:11

thelatemail