Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using dplyr to change all infrequent strings to 'other'

Tags:

r

dplyr

I have a column of strings in my data frame and I'm trying to replace all but the n most frequently occurring strings to "other".

Though my current approach works, it seems very convoluted since it involves creating a tibble and using %in% to create a boolean vector.

So my question is: Is there an easier way to accomplish this using dplyr and, if so, how would I wrap it in a function and apply it across multiple columns using mutate_all?

library(dplyr)

# setting up the data frame:
letter.df <- data_frame(val=sample(1:25, size = 100, replace = TRUE),
                        let=rep(x = letters[1:5], length.out=100))
letter.df[1:3, 2] <- c('x','y','z')


# my current approach more or less:
top5letters <- letter.df %>% 
  count(let) %>% 
  arrange(desc(n)) %>% 
  top_n(n=5)

idx <- letter.df$let %in% top5letters$let
letter.df$let[!idx] <- 'other'
like image 585
Steve S Avatar asked Jan 29 '18 19:01

Steve S


People also ask

How do I change mutate values in R?

R dplyr::mutate() – Replace Column Values. Use mutate() and its other verbs mutate_all() , mutate_if() and mutate_at() from dplyr package to replace/update the values of the column (string, integer, or any type) in R DataFrame (data. frame). For more methods of this package refer to the R dplyr tutorial.

What is transmute R?

Source: R/mutate.R. mutate.Rd. mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. Variables can be removed by setting their value to NULL .

How do you replace Nan with 0 in dplyr?

You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.


2 Answers

Use forcats

The forcats package, part of the tidyverse, has a function fct_lump(), that does (I think) exactly what you want.

require(forcats)
letter.df %>%
    mutate(let = fct_lump(let %>% as.factor, n=5))

Forcats is designed for factors, so for your example data I had to turn the let column into a factor rather than a character. And if you really wanted it to say "other" instead of "Other", you can do fct_lump(..., n=5, other_level='other').

Demonstration w/ mutate_all()

letter.df %>%
    mutate_all(as.factor) %>%
    mutate_all(~fct_lump(.x, n=5))

Since fct_lump() is already a function, it's easy to use with mutate_all()

If conversion to a factor is the bottleneck

If your data is too large and conversion to a factor is the bottleneck, I'd recommend your approach from the question, but manually specificy which factor levels you want to keep. That would let you do the "truncation" and the conversion in one step.

letter.df %>%
    mutate(let = factor(let, levels=top5letters$let))

(The only complexity is if you have NA in your original data that you don't want to blur with 'other', because this last approach converts all non-provided levels to NA.)

like image 103
Curt F. Avatar answered Sep 28 '22 17:09

Curt F.


There is, but it may involve right_join().

letter.df %>% 
count(let) %>%
arrange(desc(n)) %>%
top_n(n=5) %>%
right_join(letter.df, by = "let") %>%
mutate(let = ifelse(is.na(n), "other", let))
like image 23
akash87 Avatar answered Sep 28 '22 15:09

akash87