I am trying to understand how cut divides and creates intervals; tried ?cut but can't be able to figure out how cut in r works. Here is my problem: <pre class="prettyprint"><code>set.seed(111) data1 <- seq(1,10, by=1) data1 [1] 1 2 3 4 5 6 7 8 9 10 data1cut<- cut(data1, breaks = c(0,1,2,3,5,7,8,10), labels = FALSE) data1cut [1] 1 2 3 4 4 5 5 6 7 7 </code></pre> 1. Why did 8,9,10 not included in data1cut result? 2. why did summary(data1) and summary(data1cut) produces different result? <pre class="prettyprint"><code>summary(data1) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.25 5.50 5.50 7.75 10.00 summary(data1cut) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.25 4.50 4.40 5.75 7.00 </code></pre> How should i better use cut so that i can create say 4 bins based on the results of summary(data1)? <pre class="prettyprint"><code>bin1 [1 -3.25] bin2 (3.25 -5.50] bin3 (5.50 -7.75] bin4 (7.75 -10] </code></pre> Thank you.

<code>cut</code> in your example splits the vector into the following parts: 0-1 (<code>1</code>); 1-2 (<code>2</code>); 2-3 (<code>3</code>); 3-5 (<code>4</code>); 5-7 (<code>5</code>); 7-8 (<code>6</code>); 8-10 (<code>7</code>) The numbers in brackets are default labels assigned by <code>cut</code> to each bin, based on the <code>breaks</code> values provided. <code>cut</code> by default is exclusive of the lower range. If you want to change that then you need to specify it in the <code>include.lowest</code> argument. <ol> <li>You did not assign labels and default argument in this function is FALSE so an integer vector of level codes (in brackets) is used instead.</li> <li><code>summary(data1)</code> is a summary of raw data and <code>summary(data1cut)</code> is a summary of your splits.</li> </ol> You can get the split you need using: <pre class="prettyprint"><code>data2cut<- cut(data1, breaks = c(1, 3.25, 5.50, 7.75, 10), labels = c("1-3.25", "3.25-5.50", "5.50-7.75", "7.75-10"), include.lowest = TRUE) </code></pre> The result is the following: <pre class="prettyprint"><code>> data2cut [1] 1-3.25 1-3.25 1-3.25 3.25-5.50 3.25-5.50 5.50-7.75 5.50-7.75 7.75-10 7.75-10 [10] 7.75-10 Levels: 1-3.25 3.25-5.50 5.50-7.75 7.75-10 </code></pre> I hope it's clear now.

How does cut with breaks work in R

Tags:

split

r

binning

I am trying to understand how cut divides and creates intervals; tried ?cut but can't be able to figure out how cut in r works.
Here is my problem:

set.seed(111)
data1 <- seq(1,10, by=1)
data1 
[1]  1  2  3  4  5  6  7  8  9 10
data1cut<- cut(data1, breaks = c(0,1,2,3,5,7,8,10), labels = FALSE)
data1cut
[1] 1 2 3 4 4 5 5 6 7 7

1. Why did 8,9,10 not included in data1cut result?
2. why did summary(data1) and summary(data1cut) produces different result?

summary(data1)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.00    3.25    5.50    5.50    7.75   10.00 

summary(data1cut)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.00    3.25    4.50    4.40    5.75    7.00

How should i better use cut so that i can create say 4 bins based on the results of summary(data1)?

bin1 [1 -3.25]
bin2 (3.25 -5.50]
bin3 (5.50 -7.75]
bin4 (7.75 -10]

Thank you.

762

asked Aug 24 '16 12:08

deepseefan

1 Answers

cut in your example splits the vector into the following parts: 0-1 (1); 1-2 (2); 2-3 (3); 3-5 (4); 5-7 (5); 7-8 (6); 8-10 (7)

The numbers in brackets are default labels assigned by cut to each bin, based on the breaks values provided.

cut by default is exclusive of the lower range. If you want to change that then you need to specify it in the include.lowest argument.

You did not assign labels and default argument in this function is FALSE so an integer vector of level codes (in brackets) is used instead.
summary(data1) is a summary of raw data and summary(data1cut) is a summary of your splits.

You can get the split you need using:

data2cut<- 
  cut(data1, breaks = c(1, 3.25, 5.50, 7.75, 10),
      labels = c("1-3.25", "3.25-5.50", "5.50-7.75", "7.75-10"),
      include.lowest = TRUE)

The result is the following:

> data2cut

 [1] 1-3.25    1-3.25    1-3.25    3.25-5.50 3.25-5.50 5.50-7.75 5.50-7.75 7.75-10   7.75-10  
[10] 7.75-10  
Levels: 1-3.25 3.25-5.50 5.50-7.75 7.75-10

I hope it's clear now.

answered Oct 14 '22 18:10

epo3

Related questions
                            
                                'Labels on top' with facet_grid, or 'space option' with facet_wrap
                            
                                Combining runs of nominal variables
                            
                                Update a column of NAs in one data table with the value from a column in another data table
                            
                                shutdown PC after finishing a script
                            
                                Faster way to summarise variables based on column
                            
                                Multiple titles in facet_wrap (ggplot2)
                            
                                Sum nlayers of a rasterStack in R
                            
                                What is the rationale for as.logical double coercion?
                            
                                Add raster to ggmap base map: set alpha (transparency) and fill color to inset_raster() in ggplot2
                            
                                Silhouette plot in R
                            
                                Mutate multiple / consecutive columns (with dplyr or base R)
                            
                                r markdown kable break table width into multiple tables below each other
                            
                                What is the R markdown equivalent to LaTeX \texttt?
                            
                                using conditionalPanel with values from checkboxGroupInput
                            
                                Downloading files from ftp with R
                            
                                Shiny Module that calls a reactive data set in parent Shiny server
                            
                                Extract cluster color from output of dendextend::circlize_dendrogram()
                            
                                How can I use fread to read gz files in R?
                            
                                Reverse stacking order without affecting legend order in ggplot2 bar charts
                            
                                What kind of license is the best license for an R package? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With