Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reducing number of factor levels before modelling

Tags:

r

dplyr

I have a factor with 2600 levels and I want to reduce it to ~10 before modelling

I thought I could do this with an operation that says "if a factor is listed fewer than x times, it should be placed into a bucket called "other"

Here is some example data:

df <- data.frame(colour=c("blue","blue","blue","green","green","orange","grey"))

And this is the output I am hoping for:

  colour
1   blue
2   blue
3   blue
4  green
5  green
6  other
7  other

I have tried the below:

df %>% mutate(colour = ifelse(count(colour) < 2, 'other', colour))

Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'groups' applied to an object of class "factor".

like image 921
Shinobi_Atobe Avatar asked May 24 '18 08:05

Shinobi_Atobe


People also ask

How many levels should a factor have?

A factor must have at least two levels. If a factor only had one level then the effect of the factor could not be assessed.

How do I reduce the number of levels in R?

The droplevels() function in R can be used to drop unused factor levels. This function is particularly useful if we want to drop factor levels that are no longer used due to subsetting a vector or a data frame.

What is a factor level in an experiment?

Factor levels are all of the values that the factor can take (recall that a categorical variable has a set number of groups). In a designed experiment, the treatments represent each combination of factor levels. If there is only one factor with k levels, then there would be k treatments.

What is factor levels in statistics?

Factors are the variables that experimenters control during an experiment in order to determine their effect on the response variable. A factor can take on only a small number of values, which are known as factor levels.


2 Answers

There is actually a nice package in the tidyverse called forcats which helps in dealing with factors. You can use fct_lump, which does exactly what you need:

library(tidyverse)

df %>% mutate(colour = fct_lump(colour, n = 2))
#>   colour
#> 1   blue
#> 2   blue
#> 3   blue
#> 4  green
#> 5  green
#> 6  Other
#> 7  Other
like image 103
Thomas K Avatar answered Sep 22 '22 14:09

Thomas K


with tidyverse functions, you can try something like:

df %>%
  group_by(colour) %>%
  mutate(cnt = n()) %>%
  mutate(grp = if_else(cnt >= 2, as.character(colour), as.character("Other"))) %>%
  select(-cnt)

which gives (here, the threshold value being >= 2)

  colour grp  
  <fct>  <chr>
1 blue   blue 
2 blue   blue 
3 blue   blue 
4 green  green
5 green  green
6 orange Other
7 grey   Other
like image 26
Aramis7d Avatar answered Sep 23 '22 14:09

Aramis7d