I am trying to understand how cut divides and creates intervals; tried ?cut but can't be able to figure out how cut in r works.
Here is my problem:
set.seed(111)
data1 <- seq(1,10, by=1)
data1
[1] 1 2 3 4 5 6 7 8 9 10
data1cut<- cut(data1, breaks = c(0,1,2,3,5,7,8,10), labels = FALSE)
data1cut
[1] 1 2 3 4 4 5 5 6 7 7
1. Why did 8,9,10 not included in data1cut result?
2. why did summary(data1) and summary(data1cut) produces different result?
summary(data1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 5.50 5.50 7.75 10.00
summary(data1cut)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 4.50 4.40 5.75 7.00
How should i better use cut so that i can create say 4 bins based on the results of summary(data1)?
bin1 [1 -3.25]
bin2 (3.25 -5.50]
bin3 (5.50 -7.75]
bin4 (7.75 -10]
Thank you.
A break statement is used inside a loop (repeat, for, while) to stop the iterations and flow the control outside of the loop. In a nested looping situation, where there is a loop inside another loop, this statement exits from the innermost loop that is being evaluated.
The cut command removes the selected data from its original position, while the copy command creates a duplicate; in both cases the selected data is kept in temporary storage (the clipboard). The data from the clipboard is later inserted wherever a paste command is issued.
Split() is a built-in R function that divides a vector or data frame into groups according to the function's parameters. It takes a vector or data frame as an argument and divides the information into groups. The syntax for this function is as follows: split(x, f, drop = FALSE, ...)
bins - Cuts points in vector x into evenly distributed groups (bins). bins takes 3 separate approaches to generating the cuts, picks the one resulting in the least mean square deviation from the ideal cut - length(x) / target. bins points in each bin - and then merges small bins unless excat.
cut
in your example splits the vector into the following parts:
0-1 (1
); 1-2 (2
); 2-3 (3
); 3-5 (4
); 5-7 (5
); 7-8 (6
); 8-10 (7
)
The numbers in brackets are default labels assigned by cut
to each bin, based on the breaks
values provided.
cut
by default is exclusive of the lower range. If you want to change that then you need to specify it in the include.lowest
argument.
You did not assign labels and default argument in this function is FALSE so an integer vector of level codes (in brackets) is used instead.
summary(data1)
is a summary of raw data and summary(data1cut)
is a summary of your splits.
You can get the split you need using:
data2cut<-
cut(data1, breaks = c(1, 3.25, 5.50, 7.75, 10),
labels = c("1-3.25", "3.25-5.50", "5.50-7.75", "7.75-10"),
include.lowest = TRUE)
The result is the following:
> data2cut
[1] 1-3.25 1-3.25 1-3.25 3.25-5.50 3.25-5.50 5.50-7.75 5.50-7.75 7.75-10 7.75-10
[10] 7.75-10
Levels: 1-3.25 3.25-5.50 5.50-7.75 7.75-10
I hope it's clear now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With