Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I label the points of a quantile-quantile plot composed with ggplot2?

Tags:

r

ggplot2

I am building a quantile-quantile plot out of an variable called x from a data frame called df in the working example provided below. I would like to label the points with the name variable of my df dataset.

Is it possible to do this in ggplot2 without resorting to the painful solution (coding the theoretical distribution by hand and then plotting it against the empirical one)?

Edit: it happens that yes, thanks to a user who posted and then deleted his answer. See the comments after Arun's answer below. Thanks to Didzis for his otherwise clever solution with ggbuild.

# MWE
df <- structure(list(name = structure(c(1L, 2L, 3L, 4L, 5L, 7L, 9L, 
10L, 6L, 12L, 13L, 14L, 15L, 16L, 17L, 19L, 18L, 20L, 21L, 22L, 
8L, 23L, 11L, 24L), .Label = c("AUS", "AUT", "BEL", "CAN", "CYP", 
"DEU", "DNK", "ESP", "FIN", "FRA", "GBR", "GRC", "IRL", "ITA", 
"JPN", "MLT", "NLD", "NOR", "NZL", "PRT", "SVK", "SVN", "SWE", 
"USA"), class = "factor"), x = c(-0.739390016757746, 0.358177826874146, 
1.10474523846099, -0.250589535389937, -0.423112615445571, -0.862144579740376, 
0.823039669834058, 0.079521521937704, 1.08173649722493, -2.03962942823921, 
1.05571087029737, 0.187147291278723, -0.144770773941437, 0.957990771847331, 
-0.0546549555439176, -2.70142550075757, -0.391588386498849, -0.23855544527369, 
-0.242781575907386, -0.176765072121165, 0.105155860923456, 2.69031085872414, 
-0.158320176671995, -0.564560815972446)), .Names = c("name", 
"x"), row.names = c(NA, -24L), class = "data.frame")

library(ggplot2)
qplot(sample = x, data = df) + geom_abline(linetype = "dotted") + theme_bw()

# ... using names instead of points would allow to spot the outliers

I am working on an adaptation of this gist, and will consider sending other questions to CrossValidated if I have questions about the regression diagnostics, which might be of interest to CV users.

like image 596
Fr. Avatar asked Feb 19 '13 13:02

Fr.


People also ask

What are the points on a Q-Q plot?

A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set. By a quantile, we mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value.

What does a quantile-quantile plot show?

The purpose of the quantile-quantile (QQ) plot is to show if two data sets come from the same distribution. Plotting the first data set's quantiles along the x-axis and plotting the second data set's quantiles along the y-axis is how the plot is constructed.

How can you tell from a normal quantile-quantile plot of a distribution is approximately normal?

If a distribution is approximately normal, points on the normal quantile plot will lie close to a straight line. Sometimes, a line is superimposed onto the normal quantile plot. This helps visualize whether the points lie close to a straight line or not. Use the function qqline( ) to draw the line.

How do I make a normal quantile-quantile plot in R?

The R base functions qqnorm() and qqplot() can be used to produce quantile-quantile plots: qqnorm(): produces a normal QQ plot of the variable. qqline(): adds a reference line.


2 Answers

You can save your original QQ plot as object (used function ggplot() and stat_qq() instead of qplot())

g<-ggplot(df, aes(sample = x)) + stat_qq()

Then with function ggplot_build() you can extract data used for plotting. They are stored in element data[[1]]. Saved those data as new data frame.

df.new<-ggplot_build(g)$data[[1]]
head(df.new)
           x          y     sample theoretical PANEL group
1 -2.0368341 -2.7014255 -2.7014255  -2.0368341     1     1
2 -1.5341205 -2.0396294 -2.0396294  -1.5341205     1     1
3 -1.2581616 -0.8621446 -0.8621446  -1.2581616     1     1
4 -1.0544725 -0.7393900 -0.7393900  -1.0544725     1     1
5 -0.8871466 -0.5645608 -0.5645608  -0.8871466     1     1
6 -0.7415940 -0.4231126 -0.4231126  -0.7415940     1     1

Now you can add to hew data frame names of observations. Important is to use order() as data in new data frame are ordered.

df.new$name<-df$name[order(df$x)]

Now plot new data frame as usual and instead of geom_point() provide geom_text().

ggplot(df.new,aes(theoretical,sample,label=name))+geom_text()+ 
  geom_abline(linetype = "dotted") + theme_bw()

enter image description here

like image 170
Didzis Elferts Avatar answered Oct 10 '22 05:10

Didzis Elferts


The points are too close by. I would do something like this:

df <- df[with(df, order(x)), ]
df$t <- quantile(rnorm(1000), seq(0, 100, length.out = nrow(df))/100)

p <- ggplot(data = df, aes(x=t, y=x)) + geom_point(aes(colour=df$name))

This gives:

enter image description here

If you insist on having labels inside the plot, then, you could try something like:

df <- df[with(df, order(x)), ]
df$t <- quantile(rnorm(1000), seq(0, 100, length.out = nrow(df))/100)

p <- ggplot(data = df, aes(x=t, y=x)) + geom_point(aes(colour=df$name))
p <- p + geom_text(aes(x=t-0.05, y=x-0.15, label=df$name, size=1, colour=df$name))

p

enter image description here

You can play around with the x and y coordinates and if you want you can always remove the colour aesthetics.

like image 37
Arun Avatar answered Oct 10 '22 04:10

Arun