I was looking for an clear explanation of the 'labels are constructed using "(a,b]" interval notation' - as described in the cut help file, which seemed to lack an explanation.
The cut command removes the selected data from its original position, while the copy command creates a duplicate; in both cases the selected data is kept in temporary storage (the clipboard). The data from the clipboard is later inserted wherever a paste command is issued.
Note that when giving breakpoints, the default for R is that the histogram cells are right-closed (left open) intervals of the form (a,b]. You can change this with the right=FALSE option, which would change the intervals to be of the form [a,b). This is important if you have a lot of points exactly at the breakpoint.
So I tested cut on some simple examples as follows:
df <- data.frame(c(1,2,3,4,5,6,7,99))
names(df) <- 'x'
df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE)
df
x cut
# 1 <NA>
# 2 <NA>
# 3 (2,4]
# 4 (2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
# 99 <NA>
So the '(' means x>break on the left and '[' means <= (next) break on the right and if a value is lower than the lowest break it is flagged as NA, similarly if a value exceed the highest break it is also flagged as NA.
Next testing the option include.lowest = TRUE
df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE, include.lowest = TRUE)
df
x cut
# 1 <NA>
# 2 [2,4]
# 3 [2,4]
# 4 [2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
So here for the first bin between the first two breaks, the '[' on left means >=(first break) and the ']' means <=(second) break. Subsequent breaks are treated as above.
Next the NA values can be addressed by using -Inf and/or +Inf in the breaks as follows:
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE)
df
x cut
# 1 [-Inf,2]
# 2 [-Inf,2]
# 3 (2,4]
# 4 (2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
# 99 (8, Inf]
Setting the right = FALSE option swaps around the sense of the thresholds as per the example below:
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = FALSE)
df
# x cut
# 1 [-Inf,2)
# 2 [2,4)
# 3 [2,4)
# 4 [4,6)
# 5 [4,6)
# 6 [6,8)
# 7 [6,8)
# 99 [8, Inf)
Finally the labels option allows custom names for the thresholds should you so wish ...
lbls <- c('x<=2','2<x<=4','4<x<=6','6<x<=8','x>8')
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE, labels = lbls)
df
x cut
# 1 x<=2
# 2 x<=2
# 3 2<x<=4
# 4 2<x<=4
# 5 4<x<=6
# 6 4<x<=6
# 7 6<x<=8
# 99 x>8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With