Right now, I have the following data.frame which was created by <code>original.df %.% group_by(Category) %.% tally() %.% arrange(desc(n))</code>. <pre class="prettyprint"><code>DF <- structure(list(Category = c("E", "K", "M", "L", "I", "A", "S", "G", "N", "Q"), n = c(163051, 127133, 106680, 64868, 49701, 47387, 47096, 45601, 40056, 36882)), .Names = c("Category", "n"), row.names = c(NA, 10L), class = c("tbl_df", "tbl", "data.frame" )) Category n 1 E 163051 2 K 127133 3 M 106680 4 L 64868 5 I 49701 6 A 47387 7 S 47096 8 G 45601 9 N 40056 10 Q 36882 </code></pre> I want to create an "Other" field from the bottom ranked Categories by n. i.e. <pre class="prettyprint"><code> Category n 1 E 163051 2 K 127133 3 M 106680 4 L 64868 5 I 49701 6 Other 217022 </code></pre> Right now, I am doing <pre class="prettyprint"><code>rbind(filter(DF, rank(rev(n)) <= 5), summarise(filter(DF, rank(rev(n)) > 5), Category = "Other", n = sum(n))) </code></pre> which collapses all categories not in the top 5 into the Other category. But I'm curious whether there's a better way in <code>dplyr</code> or some other existing package. By "better" I mean more succinct/readable. I'm also interested in methods with cleverer or more flexible ways to choose <code>Other</code>.

This is another approach, assuming that each category (of the top 5 at least) only occurs once: <pre class="prettyprint"><code>df %.% arrange(desc(n)) %.% #you could skip this step since you arranged the input df already according to your question mutate(Category = ifelse(1:n() > 5, "Other", Category)) %.% group_by(Category) %.% summarize(n = sum(n)) # Category n #1 E 163051 #2 I 49701 #3 K 127133 #4 L 64868 #5 M 106680 #6 Other 217022 </code></pre> Edit: I just noticed that my output is not order by decreasing <code>n</code> any more. After running the code again, I found out that the order is kept until after the <code>group_by(Category)</code> but when I run the <code>summarize</code> afterwards, the order is gone (or rather, it seems to be ordered by <code>Category</code>). Is that supposed to be like that? Here are three more ways: <pre class="prettyprint"><code>m <- 5 #number of top results to show in final table (excl. "Other") n <- m+1 #preserves the order (or better: reesatblishes it by index) df <- arrange(df, desc(n)) %.% #this could be skipped if data already ordered mutate(idx = 1:n(), Category = ifelse(idx > m, "Other", Category)) %.% group_by(Category) %.% summarize(n = sum(n), idx = first(idx)) %.% arrange(idx) %.% select(-idx) #doesnt preserve the order (same result as in first dplyr solution, ordered by Category) df[order(df$n, decreasing=T),] #this could be skipped if data already ordered df[n:nrow(df),1] <- "Other" df <- aggregate(n ~ Category, data = df, FUN = "sum") #preserves the order (without extra index) df[order(df$n, decreasing=T),] #this could be skipped if data already ordered df[n:nrow(df),1] <- "Other" df[n,2] <- sum(df$n[df$Category == "Other"]) df <- df[1:n,] </code></pre>

Creating an "other" field

Tags:

r

dplyr

Right now, I have the following data.frame which was created by original.df %.% group_by(Category) %.% tally() %.% arrange(desc(n)).

DF <- structure(list(Category = c("E", "K", "M", "L", "I", "A", 
"S", "G", "N", "Q"), n = c(163051, 127133, 106680, 64868, 49701, 
47387, 47096, 45601, 40056, 36882)), .Names = c("Category", 
"n"), row.names = c(NA, 10L), class = c("tbl_df", "tbl", "data.frame"
))

         Category      n
1               E 163051
2               K 127133
3               M 106680
4               L  64868
5               I  49701
6               A  47387
7               S  47096
8               G  45601
9               N  40056
10              Q  36882

I want to create an "Other" field from the bottom ranked Categories by n. i.e.

        Category      n
1              E 163051
2              K 127133
3              M 106680
4              L  64868
5              I  49701
6          Other 217022

Right now, I am doing

rbind(filter(DF, rank(rev(n)) <= 5), 
  summarise(filter(DF, rank(rev(n)) > 5), Category = "Other", n = sum(n)))

which collapses all categories not in the top 5 into the Other category.

But I'm curious whether there's a better way in dplyr or some other existing package. By "better" I mean more succinct/readable. I'm also interested in methods with cleverer or more flexible ways to choose Other.

260

asked May 19 '14 05:05

Hugh

3 Answers

This is another approach, assuming that each category (of the top 5 at least) only occurs once:

df %.% 
  arrange(desc(n)) %.%       #you could skip this step since you arranged the input df already according to your question
  mutate(Category = ifelse(1:n() > 5, "Other", Category)) %.%
  group_by(Category) %.%
  summarize(n = sum(n))

#  Category      n
#1        E 163051
#2        I  49701
#3        K 127133
#4        L  64868
#5        M 106680
#6    Other 217022

Edit:

I just noticed that my output is not order by decreasing n any more. After running the code again, I found out that the order is kept until after the group_by(Category) but when I run the summarize afterwards, the order is gone (or rather, it seems to be ordered by Category). Is that supposed to be like that?

Here are three more ways:

m <- 5    #number of top results to show in final table (excl. "Other")
n <- m+1

#preserves the order (or better: reesatblishes it by index)
df <- arrange(df, desc(n)) %.%    #this could be skipped if data already ordered 
  mutate(idx = 1:n(), Category = ifelse(idx > m, "Other", Category)) %.%
  group_by(Category) %.%
  summarize(n = sum(n), idx = first(idx)) %.%
  arrange(idx) %.%
  select(-idx)

#doesnt preserve the order (same result as in first dplyr solution, ordered by Category)
df[order(df$n, decreasing=T),]     #this could be skipped if data already ordered 
df[n:nrow(df),1] <- "Other"
df <- aggregate(n ~ Category, data = df, FUN = "sum")

#preserves the order (without extra index)
df[order(df$n, decreasing=T),]     #this could be skipped if data already ordered 
df[n:nrow(df),1] <- "Other"
df[n,2] <- sum(df$n[df$Category == "Other"]) 
df <- df[1:n,]

151

answered Nov 07 '22 13:11

talat

Different package/different syntax version:

library(data.table)

dt = as.data.table(DF)

dt[order(-n), # your data is already sorted, so this does nothing for it
   if (.BY[[1]]) .SD else list("Other", sum(n)),
   by = 1:nrow(dt) <= 5][, !"nrow", with = F]
#   Category      n
#1:        E 163051
#2:        K 127133
#3:        M 106680
#4:        L  64868
#5:        I  49701
#6:    Other 217022

answered Nov 07 '22 12:11

eddi

This function modifies a column, replacing the infrequent entries with Other, either by specifying a minimum frequency, or by specifying the resultant number of categories intended.

#' @title Group infrequent entries into 'Other category'
#' @description Useful when you want to constrain the number of unique values in a column.
#' @param .data Data containing variable.
#' @param var Variable containing infrequent entries, to be collapsed into "Other". 
#' @param n Threshold for total number of categories above "Other".
#' @param count Threshold for total count of observations before "Other".
#' @param by Extra variables to group by when calculating \code{n} or \code{count}.
#' @param copy Should \code{.data} be copied? Currently only \code{TRUE} is supported.
#' @param other.category Value that infrequent entries are to be collapsed into. Defaults to \code{"Other"}.
#' @return \code{.data} but with \code{var} changed to be grouped into smaller categories.
#' @export 
mutate_other <- function(.data, var, n = 5, count, by = NULL, copy = TRUE, other.category = "Other"){
  stopifnot(is.data.table(.data), 
            is.character(other.category), 
            identical(length(other.category), 1L))

  had.key <- haskey(.data)

  if (!isTRUE(copy)){
    stop("copy must be TRUE")
  }

  out <- copy(.data)

  if (had.key){
    orig_key <- key(out)
  } else {
    orig_key <- "_order"
    out[, "_order" := 1:.N]
    setkeyv(out, "_order")
  }

  if (is.character(.data[[var]])){
    stopifnot(!("nvar" %in% names(.data)),
              var %in% names(.data))

    N <- .rank <- NULL
    n_by_var <-
      out %>%
      .[, .N, keyby = c(var, by)] %>%
      .[, .rank := rank(-N)]

    out <- merge(out, n_by_var, by = c(var, by))

    if (missing(count)){
      out[, (var) := dplyr::if_else(.rank <= n, out[[var]], other.category)]
    } else {
      out[, (var) := dplyr::if_else(N >= count, out[[var]], other.category)]
    }
    out <- 
      out %>%
      .[, N := NULL] %>%
      .[, .rank := NULL] 

    setkeyv(out, orig_key)

    if (!had.key){
      out[, (orig_key) := NULL]
      setkey(out, NULL)
    }
    out

  } else {
    warning("Attempted to use by = on a non-character vector. Aborting.")
    return(.data)
  }
}

https://github.com/HughParsonage/hutils/blob/master/R/mutate_other.R

answered Nov 07 '22 12:11

Hugh

Related questions
                            
                                Print data frame with columns center-aligned
                            
                                How to add the total sums to the table and get proportion for each cell in R
                            
                                How to load only specific functions from a package
                            
                                Apply t-test on many columns in a dataframe split by factor
                            
                                Identify dates in the same week
                            
                                Fast melted data.table operations
                            
                                trouble adding geom_vline to ggplot2
                            
                                Set page width in Knitr for md or HTML output
                            
                                merge 3 data.frames by column names
                            
                                Mahalanobis distance in R
                            
                                Why does 1..99,999 == "1".."99,999" in R, but 100,000 != "100,000"?
                            
                                SpatialLinesDataFrame: how to calculate the min. distance between a point and a line
                            
                                2 Column Report in R Markdown - Render HTML aside Data Frame
                            
                                How to calculate any negative number to the power of some fraction in R?
                            
                                efficient use of R data.table and unique()
                            
                                Formatting reactive data.frames in Shiny
                            
                                Silence messages about masked functions
                            
                                Subtracting months - issue with last day of month?
                            
                                How to color entire background in ggplot2 when using coord_fixed
                            
                                Frequency table including zeros for unused values, on a data.table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating an "other" field

Tags:

r

dplyr

Hugh

People also ask

3 Answers

talat

eddi

Hugh

Recent Activity

Donate For Us