Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I create binned factor variables from a continuous variable, with custom breaks?

Tags:

r

I have a vector that looks like this:

dataset <- c(4,7,9,1,10,15,18,19,3,16,10,16,12,22,2,23,16,17)

I would like to create four dummy categories, in which I bin the continuous dataset by custom breaks . .. for example: 1:4, 5:9, 10:17, 18:23.

The output dummy categories would have the same length as the original continuous vector (18 in this case), but now each binned dummy variable would just contain a 1 or a 0.

like image 358
Luke Avatar asked Sep 10 '12 14:09

Luke


2 Answers

Use cut:

data.frame(dataset, bin=cut(dataset, c(1,4,9,17,23), include.lowest=TRUE))
like image 99
Joshua Ulrich Avatar answered Oct 27 '22 23:10

Joshua Ulrich


I agree with Joshua that cut is what most people would think of for this task. I don't happen to like its defaults, preferring to have left-closed intervals and it's a minor pain to set that up correctly with cut (although it can be done. Fortunately for my feeble brain, Frank Harrell has designed a cut2 function in his Hmisc package whose defaults I prefer. A third alternative is to use findInterval which is especially suited for problems where you wnat to use the result as an index to another extractions or selection process. Its results are roughly what you would get if you applied as.numeric to the results of cut:

require(Hmisc)
cut2(dataset, c(1,4,9,17,23) )
 [1] [ 4, 9) [ 4, 9) [ 9,17) [ 1, 4) [ 9,17) [ 9,17) [17,23] [17,23] [ 1, 4) [ 9,17)
[11] [ 9,17) [ 9,17) [ 9,17) [17,23] [ 1, 4) [17,23] [ 9,17) [17,23]

(Notice that findInterval will use the upper bound as the closed end to form an extra interval unless you replace the maximum with Inf , a reserved word for infinity in R.)

findInterval(dataset, c( c(1,4,9,17,23) ) )
 [1] 2 2 3 1 3 3 4 4 1 3 3 3 3 4 1 5 3 4
as.numeric( cut(dataset, c(1,4,9,17,Inf), include.lowest=TRUE))
 [1] 1 2 2 1 3 3 4 4 1 3 3 3 3 4 1 4 3 3
as.numeric( cut(dataset, c(1,4,9,17,23), include.lowest=TRUE))
 [1] 1 2 2 1 3 3 4 4 1 3 3 3 3 4 1 4 3 3
like image 41
IRTFM Avatar answered Oct 27 '22 23:10

IRTFM