I was looking for an clear explanation of the 'labels are constructed using "(a,b]" interval notation' - as described in the cut help file, which seemed to lack an explanation.

So I tested cut on some simple examples as follows: <pre class="prettyprint"><code>df <- data.frame(c(1,2,3,4,5,6,7,99)) names(df) <- 'x' df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE) df x cut # 1 <NA> # 2 <NA> # 3 (2,4] # 4 (2,4] # 5 (4,6] # 6 (4,6] # 7 (6,8] # 99 <NA> </code></pre> So the '(' means x>break on the left and '[' means <= (next) break on the right and if a value is lower than the lowest break it is flagged as NA, similarly if a value exceed the highest break it is also flagged as NA. Next testing the option include.lowest = TRUE <pre class="prettyprint"><code>df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE, include.lowest = TRUE) df x cut # 1 <NA> # 2 [2,4] # 3 [2,4] # 4 [2,4] # 5 (4,6] # 6 (4,6] # 7 (6,8] </code></pre> So here for the first bin between the first two breaks, the '[' on left means >=(first break) and the ']' means <=(second) break. Subsequent breaks are treated as above. Next the NA values can be addressed by using -Inf and/or +Inf in the breaks as follows: <pre class="prettyprint"><code>df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE) df x cut # 1 [-Inf,2] # 2 [-Inf,2] # 3 (2,4] # 4 (2,4] # 5 (4,6] # 6 (4,6] # 7 (6,8] # 99 (8, Inf] </code></pre> Setting the right = FALSE option swaps around the sense of the thresholds as per the example below: <pre class="prettyprint"><code>df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = FALSE) df # x cut # 1 [-Inf,2) # 2 [2,4) # 3 [2,4) # 4 [4,6) # 5 [4,6) # 6 [6,8) # 7 [6,8) # 99 [8, Inf) </code></pre> Finally the labels option allows custom names for the thresholds should you so wish ... <pre class="prettyprint"><code>lbls <- c('x<=2','2<x<=4','4<x<=6','6<x<=8','x>8') df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE, labels = lbls) df x cut # 1 x<=2 # 2 x<=2 # 3 2<x<=4 # 4 2<x<=4 # 5 4<x<=6 # 6 4<x<=6 # 7 6<x<=8 # 99 x>8 </code></pre>

Using the R cut function - how do the breaks and labels options work

1 Answers

So I tested cut on some simple examples as follows:

df <- data.frame(c(1,2,3,4,5,6,7,99))
names(df) <- 'x'
df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE)
df
       x cut
#      1  <NA>
#      2  <NA>
#      3 (2,4]
#      4 (2,4]
#      5 (4,6]
#      6 (4,6]
#      7 (6,8]
#     99  <NA>

So the '(' means x>break on the left and '[' means <= (next) break on the right and if a value is lower than the lowest break it is flagged as NA, similarly if a value exceed the highest break it is also flagged as NA.

Next testing the option include.lowest = TRUE

df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE, include.lowest = TRUE)
df
  x   cut
# 1  <NA>
# 2 [2,4]
# 3 [2,4]
# 4 [2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]

So here for the first bin between the first two breaks, the '[' on left means >=(first break) and the ']' means <=(second) break. Subsequent breaks are treated as above.

Next the NA values can be addressed by using -Inf and/or +Inf in the breaks as follows:

df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE)
df

   x      cut
#  1 [-Inf,2]
#  2 [-Inf,2]
#  3    (2,4]
#  4    (2,4]
#  5    (4,6]
#  6    (4,6]
#  7    (6,8]
# 99 (8, Inf]

Setting the right = FALSE option swaps around the sense of the thresholds as per the example below:

df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = FALSE)
df
#   x      cut
#  1 [-Inf,2)
#  2    [2,4)
#  3    [2,4)
#  4    [4,6)
#  5    [4,6)
#  6    [6,8)
#  7    [6,8)
# 99    [8, Inf)

Finally the labels option allows custom names for the thresholds should you so wish ...

lbls <- c('x<=2','2<x<=4','4<x<=6','6<x<=8','x>8')
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE, labels = lbls)
df
   x    cut
#  1   x<=2
#  2   x<=2
#  3 2<x<=4
#  4 2<x<=4
#  5 4<x<=6
#  6 4<x<=6
#  7 6<x<=8
# 99    x>8

answered Nov 15 '22 09:11

Markm0705

Related questions
                            
                                Sweep-like operations with dplyr/tidyverse
                            
                                R plotly ignoring text label alignment hjust
                            
                                How to interpret the probabilities (p0, p1) of the result of h2o.predict()
                            
                                Change output width of plotly chart size in R Markdown PDF output
                            
                                Loading shiny module only when menu items is clicked
                            
                                How can I have the search option based on typing letters in pickerInput using shinyWidgets?
                            
                                How to add a complex label with italics and a variable to ggplot?
                            
                                Stack a named Date list to data.frame
                            
                                Naive Bayes in Quanteda vs caret: wildly different results
                            
                                Mutliple formatted text on pptx by using officer package on R
                            
                                Image processing: Average grayscale images
                            
                                Unable to pass user inputs into R shiny modules
                            
                                R's equivalent of string.replace() in python
                            
                                Shiny widgets in DT Table
                            
                                R Mutate multiple columns with ifelse()-condition
                            
                                Reading numpy ndarrays into R?
                            
                                How to format the input of Shiny updated numericInput but not change the actual value?
                            
                                Extract p-value from checkresiduals function
                            
                                Converting unit abbreviations to numbers
                            
                                Change filename when downloading data from datatable R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using the R cut function - how do the breaks and labels options work

Tags:

r

Markm0705

People also ask

1 Answers

Markm0705

Recent Activity

Donate For Us