I have a dataset that unequal number of repetition. I want to subset a data by removing those entries that are incomplete (i.e. replication less than maximum). Just small example:
set.seed(123)
mydt <- data.frame (name= rep ( c("A", "B", "C", "D", "E"), c(1,2,4,4, 3)),
var1 = rnorm (14, 3,1), var2 = rnorm (14, 4,1))
mydt
name var1 var2
1 A 2.439524 3.444159
2 B 2.769823 5.786913
3 B 4.558708 4.497850
4 C 3.070508 2.033383
5 C 3.129288 4.701356
6 C 4.715065 3.527209
7 C 3.460916 2.932176
8 D 1.734939 3.782025
9 D 2.313147 2.973996
10 D 2.554338 3.271109
11 D 4.224082 3.374961
12 E 3.359814 2.313307
13 E 3.400771 4.837787
14 E 3.110683 4.153373
summary(mydt)
name var1 var2
A:1 Min. :1.735 Min. :2.033
B:2 1st Qu.:2.608 1st Qu.:3.048
C:4 Median :3.120 Median :3.486
D:4 Mean :3.203 Mean :3.688
E:3 3rd Qu.:3.446 3rd Qu.:4.412
Max. :4.715 Max. :5.787
I want to get rid of A, B, E from the data as they are incomplete. Thus expected output:
name var1 var2
4 C 3.070508 2.033383
5 C 3.129288 4.701356
6 C 4.715065 3.527209
7 C 3.460916 2.932176
8 D 1.734939 3.782025
9 D 2.313147 2.973996
10 D 2.554338 3.271109
11 D 4.224082 3.374961
Please note the dataset is big, the following may not a option:
mydt[mydt$name == "C",]
mydt[mydt$name == "D", ]
Here is a solution using data.table
:
library(data.table)
DT <- data.table(mydt, key = "name")
DT[, N := .N, by = key(DT)][N == max(N)]
# name var1 var2 N
# 1: C 3.070508 2.033383 4
# 2: C 3.129288 4.701356 4
# 3: C 4.715065 3.527209 4
# 4: C 3.460916 2.932176 4
# 5: D 1.734939 3.782025 4
# 6: D 2.313147 2.973996 4
# 7: D 2.554338 3.271109 4
# 8: D 4.224082 3.374961 4
.N
gives you the number of cases per group, and using data.table
's option to compound queries, you can immediately subset based on whatever condition you want from this new variable.
There are several approaches in base R too, the most obvious of which is table
:
with(mydt, mydt[name %in% names(which(table(name) == max(table(name)))), ])
Probably less common, but similar in approach to the data.table
suggestion, is to use ave()
:
counts <- with(mydt, as.numeric(ave(as.character(name), name, FUN = length)))
mydt[counts == max(counts), ]
Here is a simple method that doesn't require creating an additional data structure
tabl <- table(mydt[,1])
toRemove <- names(which(tabl < max(tabl)))
mydt[!mydt[,1] %in% toRemove, ]
# name var1 var2
# 4 C 3.070508 2.033383
# 5 C 3.129288 4.701356
# 6 C 4.715065 3.527209
# 7 C 3.460916 2.932176
# 8 D 1.734939 3.782025
# 9 D 2.313147 2.973996
# 10 D 2.554338 3.271109
# 11 D 4.224082 3.374961
mydt[!mydt[,1] %in% names(which(table(mydt[,1]) < max(table(mydt[,1])))), ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With