Elegant way to drop rare factor levels from data frame

Tags:

r

subset

I want to subset a dataframe by factor. I only want to retain factor levels above a certain frequency.

df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12))

This code creates data frame:

   factor    variable
1       a -1.55902013
2       a  0.22355431
3       a -1.52195456
4       a -0.32842689
5       a  0.85650212
6       b  0.00962240
7       b -0.06621508
8       b -1.41347823
9       b  0.08969098
10      b  1.31565582
11      c -1.26141417
12      c -0.33364069

And I want to drop factor levels which repeated less than 5 times. I developed a for-loop and it is working:

for (i in 1:length(levels(df$factor))){
  if(table(df$factor)[i] < 5){
    df.new <- df[df$factor != names(table(df$factor))[i],] 
  }
}

But do quicker and prettier solutions exists?

828

asked Jun 17 '14 08:06

BiXiC

2 Answers

require(dplyr)

df %>% group_by(factor) %>% filter(n() >= 5)
#factor   variable
#1       a  2.0769363
#2       a  0.6187513
#3       a  0.2426108
#4       a -0.4279296
#5       a  0.2270024
#6       b -0.6839748
#7       b -0.3285610
#8       b  0.2625743
#9       b -0.9532957
#10      b  1.4526317

176

answered Nov 11 '22 21:11

talat

What about

df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),]

answered Nov 11 '22 21:11

Ricky

Related questions
                            
                                Regular expressions in R to erase all characters after the first space?
                            
                                How can I get xtabs to calculate means instead of sums in R?
                            
                                Updating ggplot2 code for new version
                            
                                How to pass na.rm as argument to tapply?
                            
                                Returning first row of group
                            
                                NaiveBayes in R Cannot Predict - factor(0) Levels:
                            
                                Convert decimal day to HH:MM
                            
                                What can cause a “non-unique matches detected” error in an r merge?
                            
                                Earliest Date for each id in R
                            
                                dplyr - filter by group size
                            
                                How to erase all attributes?
                            
                                outer() equivalent for non-vector lists in R
                            
                                How to create an "inkblot" chart with R?
                            
                                Out of memory when modifying a big R data.frame
                            
                                XPath to extract text after br tags in R
                            
                                How can I determine if try returned an error or not?
                            
                                How to generate all possible combinations of vectors without caring for order?
                            
                                Calculating column means based on values in another column [duplicate]
                            
                                Passing a `data.table` to c++ functions using `Rcpp` and/or `RcppArmadillo`
                            
                                Arrange ggplots together in custom ratios and spacing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With