Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating an "other" field

Tags:

r

dplyr

Right now, I have the following data.frame which was created by original.df %.% group_by(Category) %.% tally() %.% arrange(desc(n)).

DF <- structure(list(Category = c("E", "K", "M", "L", "I", "A", 
"S", "G", "N", "Q"), n = c(163051, 127133, 106680, 64868, 49701, 
47387, 47096, 45601, 40056, 36882)), .Names = c("Category", 
"n"), row.names = c(NA, 10L), class = c("tbl_df", "tbl", "data.frame"
))

         Category      n
1               E 163051
2               K 127133
3               M 106680
4               L  64868
5               I  49701
6               A  47387
7               S  47096
8               G  45601
9               N  40056
10              Q  36882

I want to create an "Other" field from the bottom ranked Categories by n. i.e.

        Category      n
1              E 163051
2              K 127133
3              M 106680
4              L  64868
5              I  49701
6          Other 217022

Right now, I am doing

rbind(filter(DF, rank(rev(n)) <= 5), 
  summarise(filter(DF, rank(rev(n)) > 5), Category = "Other", n = sum(n)))

which collapses all categories not in the top 5 into the Other category.

But I'm curious whether there's a better way in dplyr or some other existing package. By "better" I mean more succinct/readable. I'm also interested in methods with cleverer or more flexible ways to choose Other.

like image 260
Hugh Avatar asked May 19 '14 05:05

Hugh


People also ask

How do I create a new custom field?

Select the “Add column” button to see the “New field” option. This button will open the menu to create a new custom field. From here, select the type of custom field you would like to create. Give your field a name select ‘Create’ to create your new custom field.

How does the other field work in a form?

The form will force an answer to the question. The form will accept/record the " Other " field, only if there is input in it. The form will not accept (let alone record) the answer when the " Other " field is selected but left blank .

How do I create a custom field in a PDF?

Create Fields in PDF Forms You should click the "Add Text Field" in "Form" tab; toggle the cursor over the page in the PDF document so as to add an interactive field. Right-click the field to choose properties, then you can set its colors, style, appearance and actions etc.

How to create fields in Microsoft Word 2013?

How to Create Fields in Word. This information is only applicable to Word 2013 and above. Adding a Legacy Text Field in Microsoft Word: Step one: Under the "Developer" tab, go to the Word "Ribbon", click the icon for "Legacy Tools". Then, you will see a drop-down menu.


3 Answers

This is another approach, assuming that each category (of the top 5 at least) only occurs once:

df %.% 
  arrange(desc(n)) %.%       #you could skip this step since you arranged the input df already according to your question
  mutate(Category = ifelse(1:n() > 5, "Other", Category)) %.%
  group_by(Category) %.%
  summarize(n = sum(n))

#  Category      n
#1        E 163051
#2        I  49701
#3        K 127133
#4        L  64868
#5        M 106680
#6    Other 217022

Edit:

I just noticed that my output is not order by decreasing n any more. After running the code again, I found out that the order is kept until after the group_by(Category) but when I run the summarize afterwards, the order is gone (or rather, it seems to be ordered by Category). Is that supposed to be like that?

Here are three more ways:

m <- 5    #number of top results to show in final table (excl. "Other")
n <- m+1

#preserves the order (or better: reesatblishes it by index)
df <- arrange(df, desc(n)) %.%    #this could be skipped if data already ordered 
  mutate(idx = 1:n(), Category = ifelse(idx > m, "Other", Category)) %.%
  group_by(Category) %.%
  summarize(n = sum(n), idx = first(idx)) %.%
  arrange(idx) %.%
  select(-idx)

#doesnt preserve the order (same result as in first dplyr solution, ordered by Category)
df[order(df$n, decreasing=T),]     #this could be skipped if data already ordered 
df[n:nrow(df),1] <- "Other"
df <- aggregate(n ~ Category, data = df, FUN = "sum")

#preserves the order (without extra index)
df[order(df$n, decreasing=T),]     #this could be skipped if data already ordered 
df[n:nrow(df),1] <- "Other"
df[n,2] <- sum(df$n[df$Category == "Other"]) 
df <- df[1:n,]
like image 151
talat Avatar answered Nov 07 '22 13:11

talat


Different package/different syntax version:

library(data.table)

dt = as.data.table(DF)

dt[order(-n), # your data is already sorted, so this does nothing for it
   if (.BY[[1]]) .SD else list("Other", sum(n)),
   by = 1:nrow(dt) <= 5][, !"nrow", with = F]
#   Category      n
#1:        E 163051
#2:        K 127133
#3:        M 106680
#4:        L  64868
#5:        I  49701
#6:    Other 217022
like image 30
eddi Avatar answered Nov 07 '22 12:11

eddi


This function modifies a column, replacing the infrequent entries with Other, either by specifying a minimum frequency, or by specifying the resultant number of categories intended.

#' @title Group infrequent entries into 'Other category'
#' @description Useful when you want to constrain the number of unique values in a column.
#' @param .data Data containing variable.
#' @param var Variable containing infrequent entries, to be collapsed into "Other". 
#' @param n Threshold for total number of categories above "Other".
#' @param count Threshold for total count of observations before "Other".
#' @param by Extra variables to group by when calculating \code{n} or \code{count}.
#' @param copy Should \code{.data} be copied? Currently only \code{TRUE} is supported.
#' @param other.category Value that infrequent entries are to be collapsed into. Defaults to \code{"Other"}.
#' @return \code{.data} but with \code{var} changed to be grouped into smaller categories.
#' @export 
mutate_other <- function(.data, var, n = 5, count, by = NULL, copy = TRUE, other.category = "Other"){
  stopifnot(is.data.table(.data), 
            is.character(other.category), 
            identical(length(other.category), 1L))

  had.key <- haskey(.data)

  if (!isTRUE(copy)){
    stop("copy must be TRUE")
  }

  out <- copy(.data)

  if (had.key){
    orig_key <- key(out)
  } else {
    orig_key <- "_order"
    out[, "_order" := 1:.N]
    setkeyv(out, "_order")
  }

  if (is.character(.data[[var]])){
    stopifnot(!("nvar" %in% names(.data)),
              var %in% names(.data))

    N <- .rank <- NULL
    n_by_var <-
      out %>%
      .[, .N, keyby = c(var, by)] %>%
      .[, .rank := rank(-N)]

    out <- merge(out, n_by_var, by = c(var, by))

    if (missing(count)){
      out[, (var) := dplyr::if_else(.rank <= n, out[[var]], other.category)]
    } else {
      out[, (var) := dplyr::if_else(N >= count, out[[var]], other.category)]
    }
    out <- 
      out %>%
      .[, N := NULL] %>%
      .[, .rank := NULL] 

    setkeyv(out, orig_key)

    if (!had.key){
      out[, (orig_key) := NULL]
      setkey(out, NULL)
    }
    out

  } else {
    warning("Attempted to use by = on a non-character vector. Aborting.")
    return(.data)
  }
}

https://github.com/HughParsonage/hutils/blob/master/R/mutate_other.R

like image 32
Hugh Avatar answered Nov 07 '22 12:11

Hugh