Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant way to drop rare factor levels from data frame

Tags:

r

subset

I want to subset a dataframe by factor. I only want to retain factor levels above a certain frequency.

df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12))

This code creates data frame:

   factor    variable
1       a -1.55902013
2       a  0.22355431
3       a -1.52195456
4       a -0.32842689
5       a  0.85650212
6       b  0.00962240
7       b -0.06621508
8       b -1.41347823
9       b  0.08969098
10      b  1.31565582
11      c -1.26141417
12      c -0.33364069

And I want to drop factor levels which repeated less than 5 times. I developed a for-loop and it is working:

for (i in 1:length(levels(df$factor))){
  if(table(df$factor)[i] < 5){
    df.new <- df[df$factor != names(table(df$factor))[i],] 
  }
}

But do quicker and prettier solutions exists?

like image 828
BiXiC Avatar asked Jun 17 '14 08:06

BiXiC


People also ask

How do you get rid of factor levels?

Removing Levels from a Factor in R Programming – droplevels() Function. droplevels() function in R programming used to remove unused levels from a Factor. droplevels(x, exclude = if(anyNA(levels(x))) NULL else NA, …)

How do I drop unused levels in R?

The droplevels() function in R can be used to drop unused factor levels. This function is particularly useful if we want to drop factor levels that are no longer used due to subsetting a vector or a data frame. where x is an object from which to drop unused factor levels.

What is the Droplevels function in R?

The droplevels R function removes unused levels of a factor. The function is typically applied to vectors or data frames.

How do you change factor levels in R?

How do I Rename Factor Levels in R? The simplest way to rename multiple factor levels is to use the levels() function. For example, to recode the factor levels “A”, “B”, and “C” you can use the following code: levels(your_df$Category1) <- c("Factor 1", "Factor 2", "Factor 3") .


2 Answers

require(dplyr)

df %>% group_by(factor) %>% filter(n() >= 5)
#factor   variable
#1       a  2.0769363
#2       a  0.6187513
#3       a  0.2426108
#4       a -0.4279296
#5       a  0.2270024
#6       b -0.6839748
#7       b -0.3285610
#8       b  0.2625743
#9       b -0.9532957
#10      b  1.4526317
like image 176
talat Avatar answered Nov 11 '22 21:11

talat


What about

df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),]
like image 23
Ricky Avatar answered Nov 11 '22 21:11

Ricky