I would like to a categorize numeric variable in my <code>data.frame</code> object with the use of <code>dplyr</code> (and have no idea how to do it). Without <code>dplyr</code>, I would probably do something like: <pre class="prettyprint"><code>df <- data.frame(a = rnorm(1e3), b = rnorm(1e3)) df$a <- cut(df$a , breaks=quantile(df$a, probs = seq(0, 1, 0.2))) </code></pre> and it would be done. However, I strongly prefer to do it with the use of some <code>dplyr</code> function (<code>mutate</code>, I suppose) in the <code>chain</code> sequence of other actions I do perform over my <code>data.frame</code>.

The <code>ggplot2</code> package has 3 functions that work well for these tasks: <ul> <li> <code>cut_number()</code>: Makes n groups with (approximately) equal numbers of observation</li> <li> <code>cut_interval()</code>: Makes n groups with equal range</li> <li> <code>cut_width</code>: Makes groups of width width</li> </ul> My go-to is <code>cut_number()</code> because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data. <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyverse) skewed_tbl <- tibble( counts = c(1:100, 1:50, 1:20, rep(1:10, 3), rep(1:5, 5), rep(1:2, 10), rep(1, 20)) ) %>% mutate( counts_cut_number = cut_number(counts, n = 4), counts_cut_interval = cut_interval(counts, n = 4), counts_cut_width = cut_width(counts, width = 25) ) # Data skewed_tbl #> # A tibble: 265 x 4 #> counts counts_cut_number counts_cut_interval counts_cut_width #> <dbl> <fct> <fct> <fct> #> 1 1 [1,3] [1,25.8] [-12.5,12.5] #> 2 2 [1,3] [1,25.8] [-12.5,12.5] #> 3 3 [1,3] [1,25.8] [-12.5,12.5] #> 4 4 (3,13] [1,25.8] [-12.5,12.5] #> 5 5 (3,13] [1,25.8] [-12.5,12.5] #> 6 6 (3,13] [1,25.8] [-12.5,12.5] #> 7 7 (3,13] [1,25.8] [-12.5,12.5] #> 8 8 (3,13] [1,25.8] [-12.5,12.5] #> 9 9 (3,13] [1,25.8] [-12.5,12.5] #> 10 10 (3,13] [1,25.8] [-12.5,12.5] #> # ... with 255 more rows summary(skewed_tbl$counts) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 1.00 3.00 13.00 25.75 42.00 100.00 # Histogram showing skew skewed_tbl %>% ggplot(aes(counts)) + geom_histogram(bins = 30) </code></pre> <img src="https://i.imgur.com/ZFfcQHo.png" alt=""> <pre class="prettyprint lang-r prettyprint-override"><code># cut_number() evenly distributes observations into bins by quantile skewed_tbl %>% ggplot(aes(counts_cut_number)) + geom_bar() </code></pre> <img src="https://i.imgur.com/SZkEQKr.png" alt=""> <pre class="prettyprint lang-r prettyprint-override"><code># cut_interval() evenly splits the interval across the range skewed_tbl %>% ggplot(aes(counts_cut_interval)) + geom_bar() </code></pre> <img src="https://i.imgur.com/4undogh.png" alt=""> <pre class="prettyprint lang-r prettyprint-override"><code># cut_width() uses the width = 25 to create bins that are 25 in width skewed_tbl %>% ggplot(aes(counts_cut_width)) + geom_bar() </code></pre> <img src="https://i.imgur.com/cVf8Ciq.png" alt=""> Created on 2018-11-01 by the reprex package (v0.2.1)

<pre class="prettyprint"><code>set.seed(123) df <- data.frame(a = rnorm(10), b = rnorm(10)) df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2)))) </code></pre> giving: <pre class="prettyprint"><code> a b 1 (-0.586,-0.316] 1.2240818 2 (-0.316,0.094] 0.3598138 3 (0.68,1.72] 0.4007715 4 (-0.316,0.094] 0.1106827 5 (0.094,0.68] -0.5558411 6 (0.68,1.72] 1.7869131 7 (0.094,0.68] 0.4978505 8 <NA> -1.9666172 9 (-1.27,-0.586] 0.7013559 10 (-0.586,-0.316] -0.4727914 </code></pre>

Categorize numeric variable with mutate

Tags:

r

dplyr

categorization

I would like to a categorize numeric variable in my data.frame object with the use of dplyr (and have no idea how to do it).

Without dplyr, I would probably do something like:

Click to copy

df <- data.frame(a = rnorm(1e3), b = rnorm(1e3)) df$a <- cut(df$a , breaks=quantile(df$a, probs = seq(0, 1, 0.2)))

and it would be done. However, I strongly prefer to do it with the use of some dplyr function (mutate, I suppose) in the chain sequence of other actions I do perform over my data.frame.

402

asked Apr 18 '14 22:04

Marta Karas

2 Answers

The ggplot2 package has 3 functions that work well for these tasks:

cut_number(): Makes n groups with (approximately) equal numbers of observation
cut_interval(): Makes n groups with equal range
cut_width: Makes groups of width width

My go-to is cut_number() because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.

Click to copy

library(tidyverse)  skewed_tbl <- tibble(     counts = c(1:100, 1:50, 1:20, rep(1:10, 3),                 rep(1:5, 5), rep(1:2, 10), rep(1, 20))     ) %>%     mutate(         counts_cut_number   = cut_number(counts, n = 4),         counts_cut_interval = cut_interval(counts, n = 4),         counts_cut_width    = cut_width(counts, width = 25)         )   # Data skewed_tbl #> # A tibble: 265 x 4 #>    counts counts_cut_number counts_cut_interval counts_cut_width #>     <dbl> <fct>             <fct>               <fct>            #>  1      1 [1,3]             [1,25.8]            [-12.5,12.5]     #>  2      2 [1,3]             [1,25.8]            [-12.5,12.5]     #>  3      3 [1,3]             [1,25.8]            [-12.5,12.5]     #>  4      4 (3,13]            [1,25.8]            [-12.5,12.5]     #>  5      5 (3,13]            [1,25.8]            [-12.5,12.5]     #>  6      6 (3,13]            [1,25.8]            [-12.5,12.5]     #>  7      7 (3,13]            [1,25.8]            [-12.5,12.5]     #>  8      8 (3,13]            [1,25.8]            [-12.5,12.5]     #>  9      9 (3,13]            [1,25.8]            [-12.5,12.5]     #> 10     10 (3,13]            [1,25.8]            [-12.5,12.5]     #> # ... with 255 more rows  summary(skewed_tbl$counts) #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  #>    1.00    3.00   13.00   25.75   42.00  100.00  # Histogram showing skew skewed_tbl %>%     ggplot(aes(counts)) +     geom_histogram(bins = 30)

Click to copy

# cut_number() evenly distributes observations into bins by quantile skewed_tbl %>%     ggplot(aes(counts_cut_number)) +     geom_bar()

Click to copy

# cut_interval() evenly splits the interval across the range skewed_tbl %>%     ggplot(aes(counts_cut_interval)) +     geom_bar()

Click to copy

# cut_width() uses the width = 25 to create bins that are 25 in width skewed_tbl %>%     ggplot(aes(counts_cut_width)) +     geom_bar()

^{Created on 2018-11-01 by the reprex package (v0.2.1)}

answered Oct 06 '22 23:10

Matt Dancho

Click to copy

set.seed(123) df <- data.frame(a = rnorm(10), b = rnorm(10))  df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2))))

giving:

Click to copy

                 a          b 1  (-0.586,-0.316]  1.2240818 2   (-0.316,0.094]  0.3598138 3      (0.68,1.72]  0.4007715 4   (-0.316,0.094]  0.1106827 5     (0.094,0.68] -0.5558411 6      (0.68,1.72]  1.7869131 7     (0.094,0.68]  0.4978505 8             <NA> -1.9666172 9   (-1.27,-0.586]  0.7013559 10 (-0.586,-0.316] -0.4727914

answered Oct 07 '22 00:10

G. Grothendieck

Related questions
                            
                                Creating a new column to a data frame using a formula from another variable
                            
                                Insert Layer underneath existing layers in ggplot2 object
                            
                                Using ggplot function in R error : could not find function ggplot
                            
                                Can't install rJava on ubuntu system
                            
                                Update a Value in One Column Based on Criteria in Other Columns
                            
                                R: applying function over matrix and keeping matrix dimensions
                            
                                How can I make R read my environmental variables?
                            
                                R reading a huge csv
                            
                                Get rid of \addlinespace in kable
                            
                                For loop in R with increments
                            
                                Are these strings or variables?
                            
                                Remove pattern from string with gsub
                            
                                R: Text progress bar in for loop
                            
                                Convert summary to data.frame
                            
                                Changing whisker definition in geom_boxplot
                            
                                How do I select variables in an R dataframe whose names contain a particular string?
                            
                                How do you extract a few random rows from a data.table on the fly
                            
                                Create URL hyperlink in R Shiny?
                            
                                purrr map equivalent of nested for loop
                            
                                Subsetting data.table set by date range in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Categorize numeric variable with mutate

Tags:

r

dplyr

categorization

Marta Karas

People also ask

2 Answers

Matt Dancho

G. Grothendieck

Recent Activity

Donate For Us