Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Hist : relationship between 'breaks' value and number/size of bins

Tags:

r

histogram

Regarding the HIST/hist() function in R/ Can anyone help me to find :

a very simple definition explaining the relationship between the specified value of 'breaks' and the number of bins produced in the histogram?

For example, I use the basic data set provided with the R tool:

data(mtcars)
hist(mtcars$mpg, break=3) --> will draw 3 bins (really??? weird!)
hist(mtcars$mpg, break=4) --> will draw 5 bins 
hist(mtcars$mpg, break=5) --> will draw 5 bins no change, same as breaks=4
hist(mtcars$mpg, break=6) --> will draw 5 bins no change, same as breaks=4
hist(mtcars$mpg, break=7) --> will draw 5 bins no change, same as breaks=4
hist(mtcars$mpg, break=8) --> will draw 5 bins no change, same as breaks=4
hist(mtcars$mpg, break=9) --> will draw 11 bins (why???)

Why would breaks = 4,5,6,7,8 lead to the same # of bins and breaks=3 lead to just 4 bins,...?

The R documentation that you can find at ?hist or the following link : http://localhost//library/graphics/html/hist.html

did not really help and I tried to establish any link between the value specified in "breaks=", the size of the bin and the number of bins and I could not find an easy or simple formula or explanation to deduct such "link".

I just do not understand what does it mean "breaks=3"? Does it mean "3 breaks" or does it mean "a break every other 3 unit" or something completely different?

I would really appreciate any hint, help, pointers of any sort.

Thank you.

like image 650
MMEL Avatar asked May 02 '18 00:05

MMEL


1 Answers

The documentation for hist says that when you specify breaks as a single number (as you did) then

the number is a suggestion only; as the breakpoints will be set to pretty values

If you follow the link to the documentation for pretty it says

The values are chosen so that they are 1, 2 or 5 times a power of 10.

You cannot span the gap between 10 and 35 in 4 evenly spaced multiples of 1,2, 5 or 10, so it chose 5 bins (6 break points). If you really want four evenly spaced bins, you could use

hist(mtcars$mpg, seq(10,35, length.out=5))

Histogram with 4 bins

Note that you need to use length.out=5 to get four bins (four starting points plus one extra endpoint). Of course, this does not give the "pretty" values.

If you don't like that the ticks on the x-axis don't line up with the bins, (I don't), you can leave off the axes in hist and add them yourself.

H = hist(mtcars$mpg, seq(10,35, length.out=5), axes=FALSE, ylim=c(0,14))
axis(side=1, at=seq(10,35, length.out=5))
axis(side=2, pretty(0:14))

Histogram 2

Further Explanation of breaks

The documentation ?hist says under breaks that there are 5 types of values that you can use for breaks. The one you are using is:

a single number giving the number of cells for the histogram

BUT as noted above, the documentation adds:

the number is a suggestion only; the breakpoints will be set to pretty values.

So when you give hist the argument breaks=4, it knows you want 4 bins, but it will also insist on using "pretty" values for the boundaries, that is, evenly spaced multiples of 1,2, 5 times a power of ten. There may also be constraints on the endpoints.

Let's investigate what it does with your mtcars$mpg data. You can get a lot of information about what hist is doing by saving the return value. I will also suppress the actual plotting of the histogram since right now I am only interested in the value.

HV = hist(mtcars$mpg, 4, plot=FALSE)

You can print out HV and see that there is a lot of information about the histogram. All we care about here is stored in breaks.

HV$breaks
[1] 10 15 20 25 30 35

This is giving the 6 boundary values for the bins (5 bins need 6 boundary values). But we asked for 4 bins, not 5! If you split the range 10-35 into four bins you get the boundaries 10, 16.25, 22.5, 28.75 and 35. These are not "pretty" boundary values. Instead, hist uses the pretty function to find nicer values for the boundaries, but that means it has to give up using 4 bins.

How many break points do we get for a range of values of breaks? Let's try 2 breaks up to 20 breaks.

sapply(2:20, function(n) 
    length(hist(mtcars$mpg, n, plot=FALSE)$breaks))
 [1]  4  4  6  6  6  6  6 13 13 13 13 13 13 13 13 25 25 25 25

Note again: 4 break points means 3 bins. 6 break points means 5 bins. There are only four different splits that are created. What are they?

unique(lapply(2:20, function(n) hist(mtcars$mpg, n, plot=FALSE)$breaks))
[[1]]
[1] 10 20 30 40
[[2]]
[1] 10 15 20 25 30 35
[[3]]
 [1] 10 12 14 16 18 20 22 24 26 28 30 32 34
[[4]]
 [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

The boundaries change by 10, 5 2 or 1 - pretty boundaries.

If you want to have more control, you need to be able to specify where you want the boundaries. That is what I did in the example above. One of the other options for specifying breaks is:

a vector giving the breakpoints between histogram cells

That is what I used when I specified seq(10,35, length.out=5). But notice the values:

seq(10,35, length.out=5)
[1] 10.00 16.25 22.50 28.75 35.00

Not pretty.

So you can have it easy and pretty, but without good control over the number of bins OR you can have control over the number of bins at the cost of more work and uglier boundaries.

like image 151
G5W Avatar answered Oct 04 '22 16:10

G5W