I have a scatter plot,I want to know how can I find the genes above and below the confidence interval lines? <img src="https://i.stack.imgur.com/xEXAa.png" alt="enter image description here"> <hr> EDIT: Reproducible example: <pre class="prettyprint"><code>library(ggplot2) #dummy data df <- mtcars[,c("mpg","cyl")] #plot ggplot(df,aes(mpg,cyl)) + geom_point() + geom_smooth() </code></pre> <img src="https://i.stack.imgur.com/iYPYk.jpg" alt="enter image description here">

I had to take a deep dive into the <code>github</code> repo but I finally got it. In order to do this you need to know how <code>stat_smooth</code> works. In this specific case the <code>loess</code> function is called to do the smoothing (the different smoothing functions can be constructed using the same process as below): So, using <code>loess</code> on this occasion we would do: <pre class="prettyprint"><code>#data df <- mtcars[,c("mpg","cyl"), with=FALSE] #run loess model cars.lo <- loess(cyl ~ mpg, df) </code></pre> Then I had to read this in order to see how the predictions are made internally in <code>stat_smooth</code>. Apparently hadley uses the <code>predictdf</code> function (which is not exported to the namespace) as follows for our case: <pre class="prettyprint"><code>predictdf.loess <- function(model, xseq, se, level) { pred <- stats::predict(model, newdata = data.frame(x = xseq), se = se) if (se) { y = pred$fit ci <- pred$se.fit * stats::qt(level / 2 + .5, pred$df) ymin = y - ci ymax = y + ci data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit) } else { data.frame(x = xseq, y = as.vector(pred)) } } </code></pre> After reading the above I was able to create my own data.frame of the predictions using: <pre class="prettyprint"><code>#get the predictions i.e. the fit and se.fit vectors pred <- predict(cars.lo, se=TRUE) #create a data.frame from those df2 <- data.frame(mpg=df$mpg, fit=pred$fit, se.fit=pred$se.fit * qt(0.95 / 2 + .5, pred$df)) </code></pre> Looking at <code>predictdf.loess</code> we can see that the upper boundary of the confidence interval is created as <code>pred$fit + pred$se.fit * qt(0.95 / 2 + .5, pred$df)</code> and the lower boundary as <code>pred$fit - pred$se.fit * qt(0.95 / 2 + .5, pred$df)</code>. Using those we can create a flag for the points over or below those boundaries: <pre class="prettyprint"><code>#make the flag outerpoints <- +(df$cyl > df2$fit + df2$se.fit | df$cyl < df2$fit - df2$se.fit) #add flag to original data frame df$outer <- outerpoints </code></pre> The <code>df$outer</code> column is probably what the OP is looking for (it takes the value of 1 if it is outside the boundaries or 0 otherwise) but just for the sake of it I am plotting it below. Notice the <code>+</code> function above is only used here to convert the logical flag into a numeric. Now if we plot as this: <pre class="prettyprint"><code>ggplot(df,aes(mpg,cyl)) + geom_point(aes(colour=factor(outer))) + geom_smooth() </code></pre> We can actually see the points inside and outside the confidence interval. Output: <img src="https://i.stack.imgur.com/NPeXG.png" alt="enter image description here"> P.S. For anyone who is interested in the upper and lower boundaries, they are created like this (speculation: although the shaded areas are probably created with geom_ribbon - or something similar - which makes them more round and pretty): <pre class="prettyprint"><code>#upper boundary ggplot(df,aes(mpg,cyl)) + geom_point(aes(colour=factor(outer))) + geom_smooth() + geom_line(data=df2, aes(mpg , fit + se.fit , group=1), colour='red') #lower boundary ggplot(df,aes(mpg,cyl)) + geom_point(aes(colour=factor(outer))) + geom_smooth() + geom_line(data=df2, aes(mpg , fit - se.fit , group=1), colour='red') </code></pre>

Find points over and under the confidence interval when using geom_stat / geom_smooth in ggplot2

Tags:

r

ggplot2

statistics

bioinformatics

I have a scatter plot,I want to know how can I find the genes above and below the confidence interval lines?

enter image description here

EDIT: Reproducible example:

library(ggplot2)
#dummy data
df <- mtcars[,c("mpg","cyl")]

#plot
ggplot(df,aes(mpg,cyl)) +
  geom_point() +
  geom_smooth()

enter image description here

444

asked Oct 12 '15 13:10

star

1 Answers

I had to take a deep dive into the github repo but I finally got it. In order to do this you need to know how stat_smooth works. In this specific case the loess function is called to do the smoothing (the different smoothing functions can be constructed using the same process as below):

So, using loess on this occasion we would do:

#data
df <- mtcars[,c("mpg","cyl"), with=FALSE]
#run loess model
cars.lo <- loess(cyl ~ mpg, df)

Then I had to read this in order to see how the predictions are made internally in stat_smooth. Apparently hadley uses the predictdf function (which is not exported to the namespace) as follows for our case:

predictdf.loess <- function(model, xseq, se, level) {
  pred <- stats::predict(model, newdata = data.frame(x = xseq), se = se)

  if (se) {
    y = pred$fit
    ci <- pred$se.fit * stats::qt(level / 2 + .5, pred$df)
    ymin = y - ci
    ymax = y + ci
    data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
  } else {
    data.frame(x = xseq, y = as.vector(pred))
  }
}

After reading the above I was able to create my own data.frame of the predictions using:

#get the predictions i.e. the fit and se.fit vectors
pred <- predict(cars.lo, se=TRUE)
#create a data.frame from those
df2 <- data.frame(mpg=df$mpg, fit=pred$fit, se.fit=pred$se.fit * qt(0.95 / 2 + .5, pred$df))

Looking at predictdf.loess we can see that the upper boundary of the confidence interval is created as pred$fit + pred$se.fit * qt(0.95 / 2 + .5, pred$df) and the lower boundary as pred$fit - pred$se.fit * qt(0.95 / 2 + .5, pred$df).

Using those we can create a flag for the points over or below those boundaries:

#make the flag
outerpoints <- +(df$cyl > df2$fit + df2$se.fit |  df$cyl < df2$fit - df2$se.fit)
#add flag to original data frame
df$outer <- outerpoints

The df$outer column is probably what the OP is looking for (it takes the value of 1 if it is outside the boundaries or 0 otherwise) but just for the sake of it I am plotting it below.

Notice the + function above is only used here to convert the logical flag into a numeric.

Now if we plot as this:

ggplot(df,aes(mpg,cyl)) +
  geom_point(aes(colour=factor(outer))) +
  geom_smooth()

We can actually see the points inside and outside the confidence interval.

Output:

enter image description here

P.S. For anyone who is interested in the upper and lower boundaries, they are created like this (speculation: although the shaded areas are probably created with geom_ribbon - or something similar - which makes them more round and pretty):

#upper boundary
ggplot(df,aes(mpg,cyl)) +
   geom_point(aes(colour=factor(outer))) +
   geom_smooth() +
   geom_line(data=df2, aes(mpg , fit + se.fit , group=1), colour='red')

#lower boundary
ggplot(df,aes(mpg,cyl)) +
   geom_point(aes(colour=factor(outer))) +
   geom_smooth() +
   geom_line(data=df2, aes(mpg , fit - se.fit , group=1), colour='red')

160

answered Sep 30 '22 17:09

LyzandeR

Related questions
                            
                                Using subscript and variable values at the same time in Axis titles in R
                            
                                Randomly re-order (shuffle) rows of a matrix?
                            
                                How to smooth a curve
                            
                                Plot of a correlation matrix in R like in Excel example
                            
                                How can I separate a matrix into smaller ones in R?
                            
                                Does anyone have experience opening hdf files in R (Windows OS)?
                            
                                Labeling outliers on boxplot in R
                            
                                ggplot2 draws two legends
                            
                                Multiple na.strings in read.table() function in R
                            
                                Hidden Markov models package in R
                            
                                Extracting first names in R
                            
                                How to find out whether a variable is a factor or continuous in R
                            
                                Move element from front of array to back of array in R
                            
                                Find consecutive values in vector in R [duplicate]
                            
                                Removing overly common words (occur in more than 80% of the documents) in R
                            
                                Print integer vector from Rcpp function
                            
                                VIFs returning aliased coefficients in R
                            
                                How to remove single space between text
                            
                                How to convert a rotated NetCDF back to a normal lat/lon grid?
                            
                                Extract a numeric pattern between two only underscores in string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find points over and under the confidence interval when using geom_stat / geom_smooth in ggplot2

Tags:

r

ggplot2

statistics

bioinformatics

star

People also ask

1 Answers

LyzandeR

Recent Activity

Donate For Us