Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Draw geom_smooth only for fits that are significant

How can I make ggplot plot geom_smooth(method="lm"), but only if it fits some criteria? For instance, if I only want to draw lines if the slope is statistically significant (i.e. p from the lm fit is less than 0.01).

EDIT: Updated to a more complex example involving facets. Instead of generating the data from scratch, I modified the diamonds data set.

library(ggplot2)
library(data.table)

data(diamonds)

set.seed(777)
d <- data.table(diamonds)
d[color %in% c("D","E"), c("x","y") := list(x + runif(1000, -5, 5),
                                            y + runif(1000, -5, 5))] 
plt <- ggplot(d) + aes(x=x, y=y, color=color) + 
    geom_point() + facet_grid(clarity ~ cut, scales="free")
plt + geom_smooth(method="lm")

enter image description here

What I would like is a way to plot all lines except those which do not have statistically significant slopes (i.e. D and E).

like image 494
Alexey Shiklomanov Avatar asked Jun 11 '15 22:06

Alexey Shiklomanov


People also ask

Is Geom_smooth a line of best fit?

geom_smooth does not plot line of best fit.

What does Geom_smooth () function do in R?

The geom smooth function is a function for the ggplot2 visualization package in R. Essentially, geom_smooth() adds a trend line over an existing plot. What is this? By default, the trend line that's added is a LOESS smooth line.

What does Geom_smooth () using formula YX mean?

The warning geom_smooth() using formula 'y ~ x' is not an error. Since you did not supply a formula for the fit, geom_smooth assumed y ~ x, which is just a linear relationship between x and y. You can avoid this warning by using geom_smooth(formula = y ~ x, method = "lm")

What is the difference between Geom_line and Geom_smooth?

Geom_line creates a single line for both panels and distributes the colors according to the colour variable, while geom_smooth does not draw the smooth line in the 2nd panel.


1 Answers

You can calculate the p-values by group and then subset in geom_smooth (per the commenters):

# Determine p-values of regression
p.vals = sapply(unique(d$z), function(i) {
  coef(summary(lm(y ~ x, data=d[z==i, ])))[2,4]
})

plt <- ggplot(d) + aes(x=x, y=y, color=z) + geom_point() 

# Select only values of z for which regression p-value is < 0.05   
plt + geom_smooth(data=d[d$z %in% names(p.vals)[p.vals < 0.05],], 
                         aes(x, y, colour=z), method='lm')

UPDATE: Per your comment, try this, for example:

p1 = ggplot(mtcars, aes(wt, mpg)) +
  geom_point() + facet_grid(am ~ carb)

dat = data.frame(x=1:5, y=26:30, carb=1:5)

p1 + geom_point(data=dat, aes(x,y), colour="red", size=5)

Note that since dat has no am column, ggplot just plots the same values in dat for each value of am. Of course you can add values for am and control what's plotted facet by facet.

UPDATE 2: I think this will take care of the faceting case. Note, however, that most of the regressions have p-values less than 0.05, probably because when you have lots of data, even tiny coefficients will be statistically significant.

## Create a list holing the p-values for regressions on each 
## combination of color, cut, and clarity
pvals = lapply(levels(d$color), function(i) {
  lapply(levels(d$cut), function(j) {
    lapply(levels(d$clarity), function(k) {
      if(nrow(d[color==i & cut==j & clarity==k, ]) > 1) {
        data.frame(color=i, cut=j, clarity=k, 
                   p.val=coef(summary(lm(y ~ x, data = d[color==i & cut==j & clarity==k, ])))[2,4])
      }
    })
  })
})

# Flatten pvals to a single list level
pvals = unlist(unlist(pvals, recursive=FALSE), recursive=FALSE)

# Turn pvals into a data frame
pvals = do.call(rbind, pvals)

# Keep only rows with p.val < 0.05
pvals = pvals[pvals$p.val < 0.05, ]

plt <- ggplot(d) + aes(x=x, y=y, color=color) + 
  geom_point() + facet_grid(clarity ~ cut, scales="free")

# Create a subset of data frame d containing only combinations of 
# color, cut, and clarity for which we want to plot regression lines
# (you could subset right in the call to geom_smooth, but I thought this would be more clear)
d.subset = d[color %in% pvals$color & 
               cut %in% pvals$cut & 
               clarity %in% pvals$clarity, ]

# Plot regression lines only for groups in d.subset
plt + geom_smooth(data=d.subset, method="lm")
like image 76
eipi10 Avatar answered Oct 13 '22 11:10

eipi10