Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: plotting posterior classification probabilities of a linear discriminant analysis in ggplot2

Tags:

Using ggord one can make nice linear discriminant analysis ggplot2 biplots (cf chapter 11, Fig 11.5 in "Biplots in practice" by M. Greenacre), as in

library(MASS) install.packages("devtools") library(devtools) install_github("fawda123/ggord") library(ggord) data(iris) ord <- lda(Species ~ ., iris, prior = rep(1, 3)/3) ggord(ord, iris$Species) 

enter image description here

I would also like to add the classification regions (shown as solid regions of the same colour as their respective group with say alpha=0.5) or the posterior probabilities of class membership (with alpha then varying according to this posterior probability and the same colour as used for each group) (as can be done in BiplotGUI, but I am looking for a ggplot2 solution). Would anyone know how to do this with ggplot2, perhaps using geom_tile ?

EDIT: below someone asks how to calculate the posterior classification probabilities & predicted classes. This goes like this:

library(MASS) library(ggplot2) library(scales) fit <- lda(Species ~ ., data = iris, prior = rep(1, 3)/3) datPred <- data.frame(Species=predict(fit)$class,predict(fit)$x) #Create decision boundaries fit2 <- lda(Species ~ LD1 + LD2, data=datPred, prior = rep(1, 3)/3) ld1lim <- expand_range(c(min(datPred$LD1),max(datPred$LD1)),mul=0.05) ld2lim <- expand_range(c(min(datPred$LD2),max(datPred$LD2)),mul=0.05) ld1 <- seq(ld1lim[[1]], ld1lim[[2]], length.out=300) ld2 <- seq(ld2lim[[1]], ld1lim[[2]], length.out=300) newdat <- expand.grid(list(LD1=ld1,LD2=ld2)) preds <-predict(fit2,newdata=newdat) predclass <- preds$class postprob <- preds$posterior df <- data.frame(x=newdat$LD1, y=newdat$LD2, class=predclass) df$classnum <- as.numeric(df$class) df <- cbind(df,postprob) head(df)             x        y     class classnum       setosa   versicolor virginica 1 -10.122541 -2.91246 virginica        3 5.417906e-66 1.805470e-10         1 2 -10.052563 -2.91246 virginica        3 1.428691e-65 2.418658e-10         1 3  -9.982585 -2.91246 virginica        3 3.767428e-65 3.240102e-10         1 4  -9.912606 -2.91246 virginica        3 9.934630e-65 4.340531e-10         1 5  -9.842628 -2.91246 virginica        3 2.619741e-64 5.814697e-10         1 6  -9.772650 -2.91246 virginica        3 6.908204e-64 7.789531e-10         1  colorfun <- function(n,l=65,c=100) { hues = seq(15, 375, length=n+1); hcl(h=hues, l=l, c=c)[1:n] } # default ggplot2 colours colors <- colorfun(3) colorslight <- colorfun(3,l=90,c=50) ggplot(datPred, aes(x=LD1, y=LD2) ) +     geom_raster(data=df, aes(x=x, y=y, fill = factor(class)),alpha=0.7,show_guide=FALSE) +     geom_contour(data=df, aes(x=x, y=y, z=classnum), colour="red2", alpha=0.5, breaks=c(1.5,2.5)) +     geom_point(data = datPred, size = 3, aes(pch = Species,  colour=Species)) +     scale_x_continuous(limits = ld1lim, expand=c(0,0)) +     scale_y_continuous(limits = ld2lim, expand=c(0,0)) +     scale_fill_manual(values=colorslight,guide=F) 

enter image description here

(well not totally sure this approach for showing classification boundaries using contours/breaks at 1.5 and 2.5 is always correct - it is correct for the boundary between species 1 and 2 and species 2 and 3, but not if the region of species 1 would be next to species 3, as I would get two boundaries there then - maybe I would have to use the approach used here where each boundary between each species pair is considered separately)

This gets me as far as plotting the classification regions. I am looking for a solution though to also plot the actual posterior classification probabilities for each species at each coordinate, using alpha (opaqueness) proportional to the posterior classification probability for each species, and a species-specific colour. In other words, with a stack of three images superimposed. As alpha blending in ggplot2 is known to be order-dependent, I think the colours of this stack would have to calculated beforehand though, and plotted using something like

qplot(x, y, data=mydata, fill=rgb, geom="raster") + scale_fill_identity()  

Here is a SAS example of what I am after:

enter image description here

Would anyone know how to do this perhaps? Or does anyone have any thoughts on how to best represent these posterior classification probabilities?

Note that the method should work for any number of groups, not just for this specific example.

like image 596
Tom Wenseleers Avatar asked Aug 08 '15 13:08

Tom Wenseleers


People also ask

What is posterior probability in LDA?

A posterior probability is the probability of assigning observations to groups given the data. A prior probability is the probability that an observation will fall into a group before you collect the data.

How linear discriminant analysis is used for classification?

Linear discriminant analysis (LDA) is used here to reduce the number of features to a more manageable number before the process of classification. Each of the new dimensions generated is a linear combination of pixel values, which form a template.

Can Discriminant Analysis be used for classification?

Extensions to LDALinear Discriminant Analysis is a simple and effective method for classification. Because it is simple and so well understood, there are many extensions and variations to the method.


Video Answer


1 Answers

I suppose the easiest way will be to show the posterior probabilities. It is pretty straightforward for your case:

datPred$maxProb <- apply(predict(fit)$posterior, 1, max) ggplot(datPred, aes(x=LD1, y=LD2) ) +   geom_raster(data=df, aes(x=x, y=y, fill = factor(class)),alpha=0.7,show_guide=FALSE) +   geom_contour(data=df, aes(x=x, y=y, z=classnum), colour="red2", alpha=0.5, breaks=c(1.5,2.5)) +   geom_point(data = datPred, size = 3, aes(pch = Species,  colour=Species, alpha = maxProb)) +   scale_x_continuous(limits = ld1lim, expand=c(0,0)) +   scale_y_continuous(limits = ld2lim, expand=c(0,0)) +   scale_fill_manual(values=colorslight, guide=F) 

enter image description here

You can see the points blend in at blue-green border.

like image 126
tonytonov Avatar answered Sep 29 '22 00:09

tonytonov