I have a column of strings in my data frame and I'm trying to replace all but the n most frequently occurring strings to "other".
Though my current approach works, it seems very convoluted since it involves creating a tibble and using %in%
to create a boolean vector.
So my question is: Is there an easier way to accomplish this using dplyr and, if so, how would I wrap it in a function and apply it across multiple columns using mutate_all
?
library(dplyr)
# setting up the data frame:
letter.df <- data_frame(val=sample(1:25, size = 100, replace = TRUE),
let=rep(x = letters[1:5], length.out=100))
letter.df[1:3, 2] <- c('x','y','z')
# my current approach more or less:
top5letters <- letter.df %>%
count(let) %>%
arrange(desc(n)) %>%
top_n(n=5)
idx <- letter.df$let %in% top5letters$let
letter.df$let[!idx] <- 'other'
R dplyr::mutate() – Replace Column Values. Use mutate() and its other verbs mutate_all() , mutate_if() and mutate_at() from dplyr package to replace/update the values of the column (string, integer, or any type) in R DataFrame (data. frame). For more methods of this package refer to the R dplyr tutorial.
Source: R/mutate.R. mutate.Rd. mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. Variables can be removed by setting their value to NULL .
You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.
The forcats
package, part of the tidyverse, has a function fct_lump()
, that does (I think) exactly what you want.
require(forcats)
letter.df %>%
mutate(let = fct_lump(let %>% as.factor, n=5))
Forcats is designed for factors, so for your example data I had to turn the let
column into a factor rather than a character. And if you really wanted it to say "other" instead of "Other", you can do fct_lump(..., n=5, other_level='other')
.
mutate_all()
letter.df %>%
mutate_all(as.factor) %>%
mutate_all(~fct_lump(.x, n=5))
Since fct_lump()
is already a function, it's easy to use with mutate_all()
If your data is too large and conversion to a factor is the bottleneck, I'd recommend your approach from the question, but manually specificy which factor levels you want to keep. That would let you do the "truncation" and the conversion in one step.
letter.df %>%
mutate(let = factor(let, levels=top5letters$let))
(The only complexity is if you have NA
in your original data that you don't want to blur with 'other'
, because this last approach converts all non-provided levels to NA
.)
There is, but it may involve right_join()
.
letter.df %>%
count(let) %>%
arrange(desc(n)) %>%
top_n(n=5) %>%
right_join(letter.df, by = "let") %>%
mutate(let = ifelse(is.na(n), "other", let))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With