Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does the right parameter do when creating a histogram in R?

Tags:

r

histogram

I am trying to figure out what the right parameter in the hist function in R does. The documentation is unfortunately unclear to someone without a deep understanding of statistics such as myself.

The documentation as stated online is:

right logical; if TRUE, the histograms cells are right-closed (left open) intervals.

What does it mean to be right-closed (or left open) intervals?

like image 284
Ryan Taylor Avatar asked Dec 27 '11 18:12

Ryan Taylor


2 Answers

When creating histograms of non-categorial data (things like pH, temperature, etc.), you need to specify things called "bins". Each bin has something called an interval specified for it. For example, if I have the data:

11  12  13  14  15  16  17  18  19

I can create 5 bins with right-open, left-closed intervals like this:

1st bin: [10, 12)
2nd bin: [12, 14)
3rd bin: [14, 16)
4th bin: [16, 18)
5th bin: [18, 20)

What this means is that the first bin will "hold" values between 10 and 12, including 10 but not including 12. The interval notation used above is shorthand for this:

1st bin: 10 ≤ x < 12
2nd bin: 12 ≤ x < 14
3rd bin: 14 ≤ x < 16
4th bin: 16 ≤ x < 18
5th bin: 18 ≤ x < 20

So that means the values 11 will go into the 1st bin, but the value 12 will go into the second bin, etc. R will do this binning process for you then draw the histogram based on how many items are in each bin. For the above data, you'll get a rather not-interesting (or interesting, depending on your expectations) histogram that is mostly flat except at the first bin.

The following examples illustrate what the different combinations of brackets and parentheses mean when using interval notation (assume x is an element of the real number line):

(1, 4) --> 1 < x < 4    left-open, right-open
[3, 7) --> 3 ≤ x < 7    left-closed, right-open
(2, 9] --> 2 < x ≤ 9    left-open, right-closed
[5, 6] --> 5 ≤ x ≤ 6    left-closed, right-closed

Note that you can't use brackets for infinities, assuming you're not using the extended real number line

(-∞, ∞)   -->   -∞ < x < ∞ 
(-∞, 20]  -->   -∞ < x ≤ 20 
[20, ∞)   -->   20 ≤ x < ∞
(1000, ∞) --> 1000 < x < ∞
(-∞, ∞]   -->   Invalid
(41, ∞]   -->   Invalid

If I want left-open, right-closed intervals, then the bins would look like this:

1st bin: (10, 12] i.e. 10 < x ≤ 12
2nd bin: (12, 14]      12 < x ≤ 14
3rd bin: (14, 16]      14 < x ≤ 16
4th bin: (16, 18]      16 < x ≤ 18
5th bin: (18, 20]      18 < x ≤ 20

See the difference? In this case, now values 11, and 12 will go into the first bin. This may change in the appearance of the histogram depending on how you bin the data. Now, this time your histogram is still almost flat but now the 5th bin is different from the rest (only 1 data point instead of 2 for the rest).

Now, fortunately in R you don't have to specify the bins yourself, but R is nice enough to ask you whether you want the bins to be left-closed, right-open ([a, b)) or left-open, right-closed ((a, b]). That's the difference you get w.r.t the "right" parameter does in the hist() function.

like image 71
In silico Avatar answered Nov 23 '22 19:11

In silico


The default is right = TRUE which gives intervals of the form (a, b]. Let's take an example to see what this means. Let's say that our data has the value 5 in it. Let's also say that the histogram is using break points of 3, 4, 5, 6. The question is which interval should our value 5 fall into? If we use right = TRUE the actual intervals that get used are (3, 4], (4, 5], (5, 6]. The interval notation (4, 5] means that it includes all the values between 4 and 5 - it doesn't include the actual value 4 but it does include the value 5. So our data point of 5 falls into this interval.

If instead we used right = FALSE the intervals would have the form [a, b) so with the same breakpoints of 3, 4, 5, 6 we would have the intervals [3, 4), [4, 5), [5, 6). This time our data point goes into the interval [5, 6) because this interval contains 5 whereas [4, 5) does not contain 5.

Essentially the 'right' parameter tells R what to do when a data point falls exactly where a breakpoint is located.

like image 22
Dason Avatar answered Nov 23 '22 17:11

Dason