Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subset unbalanced (hetero replicated replication) to complete or balance dataset in r

Tags:

dataframe

r

I have a dataset that unequal number of repetition. I want to subset a data by removing those entries that are incomplete (i.e. replication less than maximum). Just small example:

set.seed(123)
mydt <- data.frame (name= rep ( c("A", "B", "C", "D", "E"), c(1,2,4,4, 3)), 
                   var1 = rnorm (14, 3,1), var2 = rnorm (14, 4,1))
 mydt
       name     var1     var2
1     A 2.439524 3.444159
2     B 2.769823 5.786913
3     B 4.558708 4.497850
4     C 3.070508 2.033383
5     C 3.129288 4.701356
6     C 4.715065 3.527209
7     C 3.460916 2.932176
8     D 1.734939 3.782025
9     D 2.313147 2.973996
10    D 2.554338 3.271109
11    D 4.224082 3.374961
12    E 3.359814 2.313307
13    E 3.400771 4.837787
14    E 3.110683 4.153373

summary(mydt)

name       var1            var2      
 A:1   Min.   :1.735   Min.   :2.033  
 B:2   1st Qu.:2.608   1st Qu.:3.048  
 C:4   Median :3.120   Median :3.486  
 D:4   Mean   :3.203   Mean   :3.688  
 E:3   3rd Qu.:3.446   3rd Qu.:4.412  
       Max.   :4.715   Max.   :5.787 

I want to get rid of A, B, E from the data as they are incomplete. Thus expected output:

name     var1     var2
4     C 3.070508 2.033383
5     C 3.129288 4.701356
6     C 4.715065 3.527209
7     C 3.460916 2.932176
8     D 1.734939 3.782025
9     D 2.313147 2.973996
10    D 2.554338 3.271109
11    D 4.224082 3.374961

Please note the dataset is big, the following may not a option:

mydt[mydt$name == "C",]
mydt[mydt$name == "D", ]
like image 204
SHRram Avatar asked Dec 19 '12 17:12

SHRram


2 Answers

Here is a solution using data.table:

library(data.table)
DT <- data.table(mydt, key = "name")
DT[, N := .N, by = key(DT)][N == max(N)]
#    name     var1     var2 N
# 1:    C 3.070508 2.033383 4
# 2:    C 3.129288 4.701356 4
# 3:    C 4.715065 3.527209 4
# 4:    C 3.460916 2.932176 4
# 5:    D 1.734939 3.782025 4
# 6:    D 2.313147 2.973996 4
# 7:    D 2.554338 3.271109 4
# 8:    D 4.224082 3.374961 4

.N gives you the number of cases per group, and using data.table's option to compound queries, you can immediately subset based on whatever condition you want from this new variable.

There are several approaches in base R too, the most obvious of which is table:

with(mydt, mydt[name %in% names(which(table(name) == max(table(name)))), ])

Probably less common, but similar in approach to the data.table suggestion, is to use ave():

counts <- with(mydt, as.numeric(ave(as.character(name), name, FUN = length)))
mydt[counts == max(counts), ]
like image 91
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 05 '22 23:10

A5C1D2H2I1M1N2O1R2T1


Here is a simple method that doesn't require creating an additional data structure

tabl <- table(mydt[,1])
toRemove <- names(which(tabl < max(tabl)))
mydt[!mydt[,1] %in% toRemove, ]


     #    name     var1     var2
     # 4     C 3.070508 2.033383
     # 5     C 3.129288 4.701356
     # 6     C 4.715065 3.527209
     # 7     C 3.460916 2.932176
     # 8     D 1.734939 3.782025
     # 9     D 2.313147 2.973996
     # 10    D 2.554338 3.271109
     # 11    D 4.224082 3.374961


As a single line:

 mydt[!mydt[,1] %in% names(which(table(mydt[,1]) < max(table(mydt[,1])))), ]
like image 42
Ricardo Saporta Avatar answered Oct 05 '22 23:10

Ricardo Saporta