Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Categorize numeric variable with mutate

I would like to a categorize numeric variable in my data.frame object with the use of dplyr (and have no idea how to do it).

Without dplyr, I would probably do something like:

df <- data.frame(a = rnorm(1e3), b = rnorm(1e3)) df$a <- cut(df$a , breaks=quantile(df$a, probs = seq(0, 1, 0.2))) 

and it would be done. However, I strongly prefer to do it with the use of some dplyr function (mutate, I suppose) in the chain sequence of other actions I do perform over my data.frame.

like image 402
Marta Karas Avatar asked Apr 18 '14 22:04

Marta Karas


People also ask

How do you categorize a variable?

Variables may be classified into two main categories: categorical and numeric. Each category is then classified in two subcategories: nominal or ordinal for categorical variables, discrete or continuous for numeric variables.

How do I create a category from a continuous variable in R?

You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.

What command in R converts a numeric value into a categorical value?

Firstly, we will convert numerical data to categorical data using cut() function.


2 Answers

The ggplot2 package has 3 functions that work well for these tasks:

  • cut_number(): Makes n groups with (approximately) equal numbers of observation
  • cut_interval(): Makes n groups with equal range
  • cut_width: Makes groups of width width

My go-to is cut_number() because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.

library(tidyverse)  skewed_tbl <- tibble(     counts = c(1:100, 1:50, 1:20, rep(1:10, 3),                 rep(1:5, 5), rep(1:2, 10), rep(1, 20))     ) %>%     mutate(         counts_cut_number   = cut_number(counts, n = 4),         counts_cut_interval = cut_interval(counts, n = 4),         counts_cut_width    = cut_width(counts, width = 25)         )   # Data skewed_tbl #> # A tibble: 265 x 4 #>    counts counts_cut_number counts_cut_interval counts_cut_width #>     <dbl> <fct>             <fct>               <fct>            #>  1      1 [1,3]             [1,25.8]            [-12.5,12.5]     #>  2      2 [1,3]             [1,25.8]            [-12.5,12.5]     #>  3      3 [1,3]             [1,25.8]            [-12.5,12.5]     #>  4      4 (3,13]            [1,25.8]            [-12.5,12.5]     #>  5      5 (3,13]            [1,25.8]            [-12.5,12.5]     #>  6      6 (3,13]            [1,25.8]            [-12.5,12.5]     #>  7      7 (3,13]            [1,25.8]            [-12.5,12.5]     #>  8      8 (3,13]            [1,25.8]            [-12.5,12.5]     #>  9      9 (3,13]            [1,25.8]            [-12.5,12.5]     #> 10     10 (3,13]            [1,25.8]            [-12.5,12.5]     #> # ... with 255 more rows  summary(skewed_tbl$counts) #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  #>    1.00    3.00   13.00   25.75   42.00  100.00  # Histogram showing skew skewed_tbl %>%     ggplot(aes(counts)) +     geom_histogram(bins = 30) 

# cut_number() evenly distributes observations into bins by quantile skewed_tbl %>%     ggplot(aes(counts_cut_number)) +     geom_bar() 

# cut_interval() evenly splits the interval across the range skewed_tbl %>%     ggplot(aes(counts_cut_interval)) +     geom_bar() 

# cut_width() uses the width = 25 to create bins that are 25 in width skewed_tbl %>%     ggplot(aes(counts_cut_width)) +     geom_bar() 

Created on 2018-11-01 by the reprex package (v0.2.1)

like image 85
Matt Dancho Avatar answered Oct 06 '22 23:10

Matt Dancho


set.seed(123) df <- data.frame(a = rnorm(10), b = rnorm(10))  df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2)))) 

giving:

                 a          b 1  (-0.586,-0.316]  1.2240818 2   (-0.316,0.094]  0.3598138 3      (0.68,1.72]  0.4007715 4   (-0.316,0.094]  0.1106827 5     (0.094,0.68] -0.5558411 6      (0.68,1.72]  1.7869131 7     (0.094,0.68]  0.4978505 8             <NA> -1.9666172 9   (-1.27,-0.586]  0.7013559 10 (-0.586,-0.316] -0.4727914 
like image 39
G. Grothendieck Avatar answered Oct 07 '22 00:10

G. Grothendieck