Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plotting a multiple logistic regression for binary and continuous values in R

I have a data frame of mammal genera. Each row of the column is a different genus. There are three columns: a column of each genus's geographic range size (a continuous variable), a column stating whether or not a genus is found inside or outside of river basins (a binary variable), and a column stating whether the genus is found in the fossil record (a binary variable).

I have performed a multiple logistic regression to see if geographic range size and presence in/out of basins is a predictor of presence in the fossil record using the following R code.

Regression<-glm(df[ ,"FossilRecord"] ~ log(df[ ,"Geographic Range"]) + df[  ,"Basin"], family="binomial")

I am trying to find a way to visually summarize the output of this regression (other than a table of the regression summary).

I know how to do this for a single variable regression. For example, I could use a plot like this one if I wanted to see the relationship between just geographic range size and presence in the fossil record.

However, I do not know how to make a similar or equivalent plot when there are two independent variables, and one of them is binary. What are some plotting and data visualization techniques I could use in this case?

Thanks for the help!

like image 313
Sharon McMullen Avatar asked Apr 29 '16 15:04

Sharon McMullen


3 Answers

Visualization is important and yet it can be very hard. With your example, I would recommend plotting one line for predicted FossilRecord versus GeographicRange for each level of your categorical covariate (Basin). Here's an example of how to do it with the ggplot2 package

##generating data
ssize <- 100
set.seed(12345)
dat <- data.frame(
  Basin =  rbinom(ssize, 1,.4),
  GeographicRange = rnorm(ssize,10,2)
)
dat$FossilRecord = rbinom(ssize,1,(.3 + .1*dat$Basin + 0.04*dat$GeographicRange))

##fitting model
fit <- glm(FossilRecord ~ Basin + GeographicRange, family=binomial(), data=dat)

We can use the predict() function to obtain predicted response values for many GeographicRange values and for each Basin category.

##getting predicted response from model
plotting_dfm <- expand.grid(GeographicRange = seq(from=0, to = 20, by=0.1),
                           Basin = (0:1))
plotting_dfm$preds <- plogis( predict(fit , newdata=plotting_dfm))

Now you can plot the predicted results:

##plotting the predicted response on the two covariates
library(ggplot2)
pl <- ggplot(plotting_dfm, aes(x=GeographicRange, y =preds, color=as.factor(Basin)))
pl + 
  geom_point( ) +
  ggtitle("Predicted FossilRecord by GeoRange and Basin") + 
  ggplot2::ylab("Predicted FossilRecord")

This will produce a figure like this: enter image description here

like image 163
BarkleyBG Avatar answered Nov 17 '22 02:11

BarkleyBG


You can plot a separate curve for each value of the categorical variable. You didn't provide sample data, so here's an example with another data set:

library(ggplot2)

# Data
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")

# Model. gre is continuous. rank has four categories.
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)

# Predict admit probability
newdata = expand.grid(gre=seq(200,800, length.out=100), rank=1:4)
newdata$prob = predict(m1, newdata, type="response")

ggplot(newdata, aes(gre, prob, color=factor(rank), group=rank)) +
  geom_line()

enter image description here

UPDATE: To respond to @Provisional.Modulation's comment: There are lots of options, depending on what you want to highlight and what is visually clear enough to understand, given your particular data and model output.

Here's an example using the built-in mtcars data frame and a logistic regression with one categorical and two continuous predictor variables:

m1 = glm(vs ~ cyl + mpg + hp, data=mtcars, family=binomial)

Now we create a new data frame with the unique values of cyl, five quantiles of hp and a continuous sequence of mpg, which we'll put on the x-axis (you could also of course do quantiles of mpg and use hp as the x-axis variable). If you have many continuous variables, you may need to set some of them to a single value, say, the median, when you graph the relationships between other variables.

newdata = with(mtcars, expand.grid(cyl=unique(cyl), 
                                   mpg=seq(min(mpg),max(mpg),length=20),
                                   hp = quantile(hp)))

newdata$prob = predict(m1, newdata, type="response") 

Here are three potential graphs, with varying degrees of legibility.

ggplot(newdata, aes(mpg, prob, colour=factor(cyl))) +
  geom_line() +
  facet_grid(. ~ hp)

ggplot(newdata, aes(mpg, prob, colour=factor(hp), linetype=factor(cyl))) +
  geom_line()

ggplot(newdata, aes(mpg, prob, colour=factor(hp))) +
  geom_line() +
  facet_grid(. ~ cyl)

And here's another approach using geom_tile to include two continuous dimensions in each plot panel.

newdata = with(mtcars, expand.grid(cyl=unique(cyl), 
                                   mpg=seq(min(mpg),max(mpg),length=100),
                                   hp =seq(min(hp),max(hp),length=100)))

newdata$prob = predict(m1, newdata, type="response")     

ggplot(newdata, aes(mpg, hp, fill=prob)) +
  geom_tile() +
  facet_grid(. ~ cyl) +
  scale_fill_gradient2(low="red",mid="yellow",high="blue",midpoint=0.5,
                       limits=c(0,1))
like image 4
eipi10 Avatar answered Nov 17 '22 00:11

eipi10


If you're looking for a canned solution, the visreg package might work for you.

An example using @eipi10 's data

library(visreg)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
visreg(m1, "admit", by = "rank")

enter image description here

Many more options described in documentation.

like image 1
JohnSG Avatar answered Nov 17 '22 01:11

JohnSG