To clarify the question I'll briefly describe the data.
Each row in the data.frame
is an observation, and the columns represent variables pertinent to that observation including: what individual was observed, when it was observed, where it was observed, etc. I want to exclude/filter individuals for which there are fewer than 5 observations.
In other words, if there are fewer than 5 rows where individual = x, then I want to remove all rows that contain individual x and reassign the result to a new data.frame
. I'm aware of some brute force techniques using something like names == unique(df$individualname)
and then subsetting out those names individually and applying nrow
to determine whether or not to exclude them...but there has to be a better way. Any help is appreciated, I'm still pretty new to R.
If we prefer to work with the Tidyverse package, we can use the filter() function to remove (or select) rows based on values in a column (conditionally, that is, and the same as using subset). Furthermore, we can also use the function slice() from dplyr to remove rows based on the index.
The drop command is used to remove variables or observations from the dataset in memory. If you want to drop variables, use drop varlist. If you want to drop observations, use drop with an if or an in qualifier or both. 1.
To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.
An example using group_by
and filter
from dplyr
package:
library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))
> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]
id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223
or with base R:
> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
Still in base R, using with
is a more elegant way to do the very same thing:
df[with(df, id %in% names(which(table(id)>=5))), ]
or:
subset(df, with(df, id %in% names(which(table(id)>=5))))
Another way to do the same thing using the data.table
package.
library(data.table)
set.seed(1)
dt <- data.table(id=sample(1:4,20,replace=TRUE),var=sample(1:100,20))
dt1<-dt[,count:=.N,by=id][(count>=5)]
dt2<-dt[,count:=.N,by=id][(count<5)]
dt1
id var count
1: 2 94 5
2: 2 22 5
3: 3 64 5
4: 4 13 6
5: 4 37 6
6: 4 2 6
7: 3 36 5
8: 3 81 5
9: 3 90 5
10: 2 17 5
11: 4 72 6
12: 2 57 5
13: 3 67 5
14: 4 9 6
15: 2 60 5
16: 4 34 6
dt2
id var count
1: 1 26 4
2: 1 31 4
3: 1 44 4
4: 1 54 4
It can be also with data.table
using a logical condition with if
after grouping by 'id'
library(data.table)
setDT(df)[, if(.N >=5) .SD, id]
# id foo
# 1: b 0.9962453
# 2: b 0.8980123
# 3: b 0.1535324
# 4: b 0.2802848
# 5: b 0.9366375
# 6: c 0.8109557
# 7: c 0.6945285
# 8: c 0.1012925
# 9: c 0.6822955
#10: c 0.3757085
#11: c 0.7348635
#12: c 0.3026395
#13: c 0.9707223
df <- structure(list(id = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "c"), foo = c(0.8717067, 0.9086262,
0.9962453, 0.8980123, 0.1535324, 0.2802848, 0.9366375, 0.8109557,
0.6945285, 0.1012925, 0.6822955, 0.3757085, 0.7348635, 0.3026395,
0.9707223)), .Names = c("id", "foo"), class = "data.frame",
row.names = c(NA, -15L))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With