Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In ggplot2, what do the end of the boxplot lines represent?

Tags:

r

ggplot2

boxplot

I can't find a description of what the end points of the lines of a boxplot represent.

For example, here are point values above and below where the lines end. enter image description here

(I realize that the top and bottom of the box are 25th and 75th percentile, and the centerline is the 50th). I assume, as there are points above and below the lines that they do not represent the max/min values.

like image 528
djq Avatar asked Feb 09 '11 15:02

djq


People also ask

What are the ends of a boxplot?

A boxplot is a way to show a five number summary in a chart. The main part of the chart (the “box”) shows where the middle portion of the data is: the interquartile range. At the ends of the box, you” find the first quartile (the 25% mark) and the third quartile (the 75% mark).

What does Ggplot boxplot represent?

A boxplot summarizes the distribution of a continuous variable and notably displays the median of each group. This post explains how to add the value of the mean for each group with ggplot2. Boxplot Section Boxplot pitfalls. Ggplot2 allows to show the average value of each group using the stat_summary() function.

What does the line in the boxplot represent?

In a typical box plot, the top of the rectangle indicates the third quartile, a horizontal line near the middle of the rectangle indicates the median, and the bottom of the rectangle indicates the first quartile.

What are the lines on a boxplot called?

A box and whisker plot—also called a box plot—displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median.


2 Answers

The "dots" at the end of the boxplot represent outliers. There are a number of different rules for determining if a point is an outlier, but the method that R and ggplot use is the "1.5 rule". If a data point is:

  • less than Q1 - 1.5*IQR
  • greater than Q3 + 1.5*IQR

then that point is classed as an "outlier". The whiskers are defined as:

upper whisker = min(max(x), Q_3 + 1.5 * IQR)

lower whisker = max(min(x), Q_1 – 1.5 * IQR)

where IQR = Q_3 – Q_1, the box length. So the upper whisker is located at the smaller of the maximum x value and Q_3 + 1.5 IQR, whereas the lower whisker is located at the larger of the smallest x value and Q_1 – 1.5 IQR.

Additional information

  • See the wikipedia boxplot page for alternative outlier rules.
  • There are actually a variety of ways of calculating quantiles. Have a look at `?quantile for the description of the nine different methods.

Example

Consider the following example

> set.seed(1) > x = rlnorm(20, 1/2)#skewed data > par(mfrow=c(1,3)) > boxplot(x, range=1.7, main="range=1.7") > boxplot(x, range=1.5, main="range=1.5")#default > boxplot(x, range=0, main="range=0")#The same as range="Very big number" 

This gives the following plot: enter image description here

As we decrease range from 1.7 to 1.5 we reduce the length of the whisker. However, range=0 is a special case - it's equivalent to "range=infinity"

like image 113
csgillespie Avatar answered Sep 21 '22 15:09

csgillespie


I think ggplot using the standard defaults, the same as boxplot: "the whiskers extend to the most extreme data point which is no more than [1.5] times the length of the box away from the box"

See: boxplot.stats

like image 29
Tyler Avatar answered Sep 21 '22 15:09

Tyler