I have a df and I would like to remove people who have less than X amount of rows in df. E.g., in this toy example, I would like to retain people who have >= 5 rows.
df
names fruit
4 john kiwi
7 john apple
9 john banana
13 john orange
14 john apple
2 mary orange
5 mary apple
8 mary orange
10 mary apple
12 mary apple
1 tom apple
3 tom banana
6 tom apple
11 tom kiwi
example output
df
names fruit
4 john kiwi
7 john apple
9 john banana
13 john orange
14 john apple
2 mary orange
5 mary apple
8 mary orange
10 mary apple
12 mary apple
Thanks in advance!
You can use table
like this:
df[df$names %in% names(table(df$names))[table(df$names) >= 5],]
Here's a data.table
solution using the in-built .N
value, which is as described in the ?data.table
help file: ‘.N’ is an integer, length 1, containing the number of rows in the group.
#create a similar reproducible exmaple
library(data.table)
dat <- data.table(names=rep(letters[1:3],c(5,5,3)),var=1:13)
Remove the rows:
dat[, cnt:=.N, by=names][cnt >= 5]
Though I feel like there must be a way to do this without assigning a new variable. ...And now there is thanks to @mnel in the comments:
dat[,if(.N>=5).SD,by=names]
This essentially returns a sub-data.table .SD
for each value of the by
group if the number of rows in the group .N
is greater than or equal to 5. It is pretty much equivalent to the more traditional R subsetting syntax of:
dat[,.SD[.N >= 5],by=names]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With