I have a vector that looks like this:
dataset <- c(4,7,9,1,10,15,18,19,3,16,10,16,12,22,2,23,16,17)
I would like to create four dummy categories, in which I bin the continuous dataset by custom breaks . .. for example: 1:4, 5:9, 10:17, 18:23.
The output dummy categories would have the same length as the original continuous vector (18 in this case), but now each binned dummy variable would just contain a 1 or a 0.
Use cut
:
data.frame(dataset, bin=cut(dataset, c(1,4,9,17,23), include.lowest=TRUE))
I agree with Joshua that cut
is what most people would think of for this task. I don't happen to like its defaults, preferring to have left-closed intervals and it's a minor pain to set that up correctly with cut
(although it can be done. Fortunately for my feeble brain, Frank Harrell has designed a cut2
function in his Hmisc package whose defaults I prefer. A third alternative is to use findInterval
which is especially suited for problems where you wnat to use the result as an index to another extractions or selection process. Its results are roughly what you would get if you applied as.numeric
to the results of cut
:
require(Hmisc)
cut2(dataset, c(1,4,9,17,23) )
[1] [ 4, 9) [ 4, 9) [ 9,17) [ 1, 4) [ 9,17) [ 9,17) [17,23] [17,23] [ 1, 4) [ 9,17)
[11] [ 9,17) [ 9,17) [ 9,17) [17,23] [ 1, 4) [17,23] [ 9,17) [17,23]
(Notice that findInterval
will use the upper bound as the closed end to form an extra interval unless you replace the maximum with Inf
, a reserved word for infinity in R.)
findInterval(dataset, c( c(1,4,9,17,23) ) )
[1] 2 2 3 1 3 3 4 4 1 3 3 3 3 4 1 5 3 4
as.numeric( cut(dataset, c(1,4,9,17,Inf), include.lowest=TRUE))
[1] 1 2 2 1 3 3 4 4 1 3 3 3 3 4 1 4 3 3
as.numeric( cut(dataset, c(1,4,9,17,23), include.lowest=TRUE))
[1] 1 2 2 1 3 3 4 4 1 3 3 3 3 4 1 4 3 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With