Using dplyr to change all infrequent strings to 'other'

Tags:

2 Answers

Use forcats

The forcats package, part of the tidyverse, has a function fct_lump(), that does (I think) exactly what you want.

require(forcats)
letter.df %>%
    mutate(let = fct_lump(let %>% as.factor, n=5))

Forcats is designed for factors, so for your example data I had to turn the let column into a factor rather than a character. And if you really wanted it to say "other" instead of "Other", you can do fct_lump(..., n=5, other_level='other').

Demonstration w/ `mutate_all()`

letter.df %>%
    mutate_all(as.factor) %>%
    mutate_all(~fct_lump(.x, n=5))

Since fct_lump() is already a function, it's easy to use with mutate_all()

If conversion to a factor is the bottleneck

If your data is too large and conversion to a factor is the bottleneck, I'd recommend your approach from the question, but manually specificy which factor levels you want to keep. That would let you do the "truncation" and the conversion in one step.

letter.df %>%
    mutate(let = factor(let, levels=top5letters$let))

(The only complexity is if you have NA in your original data that you don't want to blur with 'other', because this last approach converts all non-provided levels to NA.)

103

answered Sep 28 '22 17:09

Curt F.

There is, but it may involve right_join().

letter.df %>% 
count(let) %>%
arrange(desc(n)) %>%
top_n(n=5) %>%
right_join(letter.df, by = "let") %>%
mutate(let = ifelse(is.na(n), "other", let))

answered Sep 28 '22 15:09

akash87

Related questions
                            
                                Sparklyr using case_when with variables
                            
                                piping with dot inside dplyr::filter
                            
                                R: Reorder factor levels with data table (for use with Plotly)
                            
                                Disable action button when textinput is empty in Shiny app [R]
                            
                                blogdown deployment newbie issue
                            
                                correlation by row, within data frame
                            
                                How to adjust the margins of a single plot inside a layout in R?
                            
                                Pass variables as parameters to plot_ly function
                            
                                Converting data frame into deeply nested list
                            
                                How to force splinefun values to be positive?
                            
                                R Shiny - Scrolling to a given row of datatable with javascript callback
                            
                                My attempt to use a "connection" while trying to read in input causes R to freeze or crash
                            
                                How to extract xy-coordinates from raster where its highest value is located within a polygon?
                            
                                R plot legend: Reduce space between legend columns
                            
                                Escape overscoping in the tidyeval framework
                            
                                R: Removing the "Chapter" part from the title in bookdown::pdf_book with documentclass: report
                            
                                Show only high density areas with ggplot2's stat_density_2d
                            
                                Sum number in a character string (R)
                            
                                Looping over factor levels in R - how to operate two consecutive levels
                            
                                Extracting elements from emmGrid of emmeans R package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using dplyr to change all infrequent strings to 'other'

Tags:

r

dplyr

Steve S

People also ask

2 Answers

Use forcats

Demonstration w/ `mutate_all()`

If conversion to a factor is the bottleneck

Curt F.

akash87

Recent Activity

Donate For Us

Using dplyr to change all infrequent strings to 'other'

Tags:

r

dplyr

Steve S

People also ask

2 Answers

Use forcats

Demonstration w/ mutate_all()

If conversion to a factor is the bottleneck

Curt F.

akash87

Related questions

Recent Activity

Donate For Us

Demonstration w/ `mutate_all()`