I would like to a categorize numeric variable in my data.frame
object with the use of dplyr
(and have no idea how to do it).
Without dplyr
, I would probably do something like:
df <- data.frame(a = rnorm(1e3), b = rnorm(1e3)) df$a <- cut(df$a , breaks=quantile(df$a, probs = seq(0, 1, 0.2)))
and it would be done. However, I strongly prefer to do it with the use of some dplyr
function (mutate
, I suppose) in the chain
sequence of other actions I do perform over my data.frame
.
Variables may be classified into two main categories: categorical and numeric. Each category is then classified in two subcategories: nominal or ordinal for categorical variables, discrete or continuous for numeric variables.
You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.
Firstly, we will convert numerical data to categorical data using cut() function.
The ggplot2
package has 3 functions that work well for these tasks:
cut_number()
: Makes n groups with (approximately) equal numbers of observationcut_interval()
: Makes n groups with equal rangecut_width
: Makes groups of width widthMy go-to is cut_number()
because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.
library(tidyverse) skewed_tbl <- tibble( counts = c(1:100, 1:50, 1:20, rep(1:10, 3), rep(1:5, 5), rep(1:2, 10), rep(1, 20)) ) %>% mutate( counts_cut_number = cut_number(counts, n = 4), counts_cut_interval = cut_interval(counts, n = 4), counts_cut_width = cut_width(counts, width = 25) ) # Data skewed_tbl #> # A tibble: 265 x 4 #> counts counts_cut_number counts_cut_interval counts_cut_width #> <dbl> <fct> <fct> <fct> #> 1 1 [1,3] [1,25.8] [-12.5,12.5] #> 2 2 [1,3] [1,25.8] [-12.5,12.5] #> 3 3 [1,3] [1,25.8] [-12.5,12.5] #> 4 4 (3,13] [1,25.8] [-12.5,12.5] #> 5 5 (3,13] [1,25.8] [-12.5,12.5] #> 6 6 (3,13] [1,25.8] [-12.5,12.5] #> 7 7 (3,13] [1,25.8] [-12.5,12.5] #> 8 8 (3,13] [1,25.8] [-12.5,12.5] #> 9 9 (3,13] [1,25.8] [-12.5,12.5] #> 10 10 (3,13] [1,25.8] [-12.5,12.5] #> # ... with 255 more rows summary(skewed_tbl$counts) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 1.00 3.00 13.00 25.75 42.00 100.00 # Histogram showing skew skewed_tbl %>% ggplot(aes(counts)) + geom_histogram(bins = 30)
# cut_number() evenly distributes observations into bins by quantile skewed_tbl %>% ggplot(aes(counts_cut_number)) + geom_bar()
# cut_interval() evenly splits the interval across the range skewed_tbl %>% ggplot(aes(counts_cut_interval)) + geom_bar()
# cut_width() uses the width = 25 to create bins that are 25 in width skewed_tbl %>% ggplot(aes(counts_cut_width)) + geom_bar()
Created on 2018-11-01 by the reprex package (v0.2.1)
set.seed(123) df <- data.frame(a = rnorm(10), b = rnorm(10)) df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2))))
giving:
a b 1 (-0.586,-0.316] 1.2240818 2 (-0.316,0.094] 0.3598138 3 (0.68,1.72] 0.4007715 4 (-0.316,0.094] 0.1106827 5 (0.094,0.68] -0.5558411 6 (0.68,1.72] 1.7869131 7 (0.094,0.68] 0.4978505 8 <NA> -1.9666172 9 (-1.27,-0.586] 0.7013559 10 (-0.586,-0.316] -0.4727914
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With