Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove individuals with fewer than 5 observations from a data frame

To clarify the question I'll briefly describe the data.

Each row in the data.frame is an observation, and the columns represent variables pertinent to that observation including: what individual was observed, when it was observed, where it was observed, etc. I want to exclude/filter individuals for which there are fewer than 5 observations.

In other words, if there are fewer than 5 rows where individual = x, then I want to remove all rows that contain individual x and reassign the result to a new data.frame. I'm aware of some brute force techniques using something like names == unique(df$individualname) and then subsetting out those names individually and applying nrow to determine whether or not to exclude them...but there has to be a better way. Any help is appreciated, I'm still pretty new to R.

like image 873
Andrew Bade Avatar asked May 21 '16 20:05

Andrew Bade


People also ask

How do I omit rows with certain values in R?

If we prefer to work with the Tidyverse package, we can use the filter() function to remove (or select) rows based on values in a column (conditionally, that is, and the same as using subset). Furthermore, we can also use the function slice() from dplyr to remove rows based on the index.

How do you drop a specific observation?

The drop command is used to remove variables or observations from the dataset in memory. If you want to drop variables, use drop varlist. If you want to drop observations, use drop with an if or an in qualifier or both. 1.

How do I remove a specific value from a Dataframe in R?

To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.


3 Answers

An example using group_by and filter from dplyr package:

library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
           foo=runif(15))

> df
id       foo
1   a 0.8717067
2   a 0.9086262
3   b 0.9962453
4   b 0.8980123
5   b 0.1535324
6   b 0.2802848
7   b 0.9366375
8   c 0.8109557
9   c 0.6945285
10  c 0.1012925
11  c 0.6822955
12  c 0.3757085
13  c 0.7348635
14  c 0.3026395
15  c 0.9707223

df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]

id       foo
(fctr)     (dbl)
1       b 0.9962453
2       b 0.8980123
3       b 0.1535324
4       b 0.2802848
5       b 0.9366375
6       c 0.8109557
7       c 0.6945285
8       c 0.1012925
9       c 0.6822955
10      c 0.3757085
11      c 0.7348635
12      c 0.3026395
13      c 0.9707223

or with base R:

> df[df$id %in% names(which(table(df$id)>=5)), ]
id       foo
3   b 0.9962453
4   b 0.8980123
5   b 0.1535324
6   b 0.2802848
7   b 0.9366375
8   c 0.8109557
9   c 0.6945285
10  c 0.1012925
11  c 0.6822955
12  c 0.3757085
13  c 0.7348635
14  c 0.3026395
15  c 0.9707223

Still in base R, using with is a more elegant way to do the very same thing:

df[with(df, id %in% names(which(table(id)>=5))), ]

or:

subset(df, with(df, id %in% names(which(table(id)>=5))))
like image 193
Vincent Bonhomme Avatar answered Oct 09 '22 04:10

Vincent Bonhomme


Another way to do the same thing using the data.table package.

library(data.table)
set.seed(1)
dt <- data.table(id=sample(1:4,20,replace=TRUE),var=sample(1:100,20))

dt1<-dt[,count:=.N,by=id][(count>=5)]
dt2<-dt[,count:=.N,by=id][(count<5)]

dt1
    id var count
 1:  2  94     5
 2:  2  22     5
 3:  3  64     5
 4:  4  13     6
 5:  4  37     6
 6:  4   2     6
 7:  3  36     5
 8:  3  81     5
 9:  3  90     5
10:  2  17     5
11:  4  72     6
12:  2  57     5
13:  3  67     5
14:  4   9     6
15:  2  60     5
16:  4  34     6


dt2
   id var count
1:  1  26     4
2:  1  31     4
3:  1  44     4
4:  1  54     4
like image 24
Mike H. Avatar answered Oct 09 '22 04:10

Mike H.


It can be also with data.table using a logical condition with if after grouping by 'id'

library(data.table)
setDT(df)[, if(.N >=5) .SD, id]
#  id       foo
# 1:  b 0.9962453
# 2:  b 0.8980123
# 3:  b 0.1535324
# 4:  b 0.2802848
# 5:  b 0.9366375
# 6:  c 0.8109557
# 7:  c 0.6945285
# 8:  c 0.1012925
# 9:  c 0.6822955
#10:  c 0.3757085
#11:  c 0.7348635
#12:  c 0.3026395
#13:  c 0.9707223

data

df <- structure(list(id = c("a", "a", "b", "b", "b", "b", "b", "c", 
"c", "c", "c", "c", "c", "c", "c"), foo = c(0.8717067, 0.9086262, 
0.9962453, 0.8980123, 0.1535324, 0.2802848, 0.9366375, 0.8109557, 
0.6945285, 0.1012925, 0.6822955, 0.3757085, 0.7348635, 0.3026395, 
0.9707223)), .Names = c("id", "foo"), class = "data.frame", 
row.names = c(NA, -15L))
like image 35
akrun Avatar answered Oct 09 '22 05:10

akrun