I have a scatter plot,I want to know how can I find the genes above and below the confidence interval lines?
EDIT: Reproducible example:
library(ggplot2)
#dummy data
df <- mtcars[,c("mpg","cyl")]
#plot
ggplot(df,aes(mpg,cyl)) +
geom_point() +
geom_smooth()
Example 2: Modify Level of Confidence Interval By default, geom_smooth() uses 95% confidence bands but you can use the level argument to specify a different confidence level.
The ggplot2 library is used to plot different graphs. We can use the geom_smooth() or geom_ribbon() method to add confidence interval lines or bands to the plot.
It then computes the confidence interval based on a set number of standard-error intervals around the predicted value (for the typical 95% CI, this is predicted ± 1.96 * se ). The Details section of geom_smooth says: Calculation is performed by the (currently undocumented) 'predictdf()' generic and its methods.
I had to take a deep dive into the github
repo but I finally got it. In order to do this you need to know how stat_smooth
works. In this specific case the loess
function is called to do the smoothing (the different smoothing functions can be constructed using the same process as below):
So, using loess
on this occasion we would do:
#data
df <- mtcars[,c("mpg","cyl"), with=FALSE]
#run loess model
cars.lo <- loess(cyl ~ mpg, df)
Then I had to read this in order to see how the predictions are made internally in stat_smooth
. Apparently hadley uses the predictdf
function (which is not exported to the namespace) as follows for our case:
predictdf.loess <- function(model, xseq, se, level) {
pred <- stats::predict(model, newdata = data.frame(x = xseq), se = se)
if (se) {
y = pred$fit
ci <- pred$se.fit * stats::qt(level / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
} else {
data.frame(x = xseq, y = as.vector(pred))
}
}
After reading the above I was able to create my own data.frame of the predictions using:
#get the predictions i.e. the fit and se.fit vectors
pred <- predict(cars.lo, se=TRUE)
#create a data.frame from those
df2 <- data.frame(mpg=df$mpg, fit=pred$fit, se.fit=pred$se.fit * qt(0.95 / 2 + .5, pred$df))
Looking at predictdf.loess
we can see that the upper boundary of the confidence interval is created as pred$fit + pred$se.fit * qt(0.95 / 2 + .5, pred$df)
and the lower boundary as pred$fit - pred$se.fit * qt(0.95 / 2 + .5, pred$df)
.
Using those we can create a flag for the points over or below those boundaries:
#make the flag
outerpoints <- +(df$cyl > df2$fit + df2$se.fit | df$cyl < df2$fit - df2$se.fit)
#add flag to original data frame
df$outer <- outerpoints
The df$outer
column is probably what the OP is looking for (it takes the value of 1 if it is outside the boundaries or 0 otherwise) but just for the sake of it I am plotting it below.
Notice the +
function above is only used here to convert the logical flag into a numeric.
Now if we plot as this:
ggplot(df,aes(mpg,cyl)) +
geom_point(aes(colour=factor(outer))) +
geom_smooth()
We can actually see the points inside and outside the confidence interval.
Output:
P.S. For anyone who is interested in the upper and lower boundaries, they are created like this (speculation: although the shaded areas are probably created with geom_ribbon - or something similar - which makes them more round and pretty):
#upper boundary
ggplot(df,aes(mpg,cyl)) +
geom_point(aes(colour=factor(outer))) +
geom_smooth() +
geom_line(data=df2, aes(mpg , fit + se.fit , group=1), colour='red')
#lower boundary
ggplot(df,aes(mpg,cyl)) +
geom_point(aes(colour=factor(outer))) +
geom_smooth() +
geom_line(data=df2, aes(mpg , fit - se.fit , group=1), colour='red')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With