I have a data frame of mammal genera. Each row of the column is a different genus. There are three columns: a column of each genus's geographic range size (a continuous variable), a column stating whether or not a genus is found inside or outside of river basins (a binary variable), and a column stating whether the genus is found in the fossil record (a binary variable).
I have performed a multiple logistic regression to see if geographic range size and presence in/out of basins is a predictor of presence in the fossil record using the following R code.
Regression<-glm(df[ ,"FossilRecord"] ~ log(df[ ,"Geographic Range"]) + df[ ,"Basin"], family="binomial")
I am trying to find a way to visually summarize the output of this regression (other than a table of the regression summary).
I know how to do this for a single variable regression. For example, I could use a plot like if I wanted to see the relationship between just geographic range size and presence in the fossil record.
However, I do not know how to make a similar or equivalent plot when there are two independent variables, and one of them is binary. What are some plotting and data visualization techniques I could use in this case?
Thanks for the help!
Visualization is important and yet it can be very hard. With your example, I would recommend plotting one line for predicted FossilRecord versus GeographicRange for each level of your categorical covariate (Basin). Here's an example of how to do it with the ggplot2 package
##generating data
ssize <- 100
set.seed(12345)
dat <- data.frame(
Basin = rbinom(ssize, 1,.4),
GeographicRange = rnorm(ssize,10,2)
)
dat$FossilRecord = rbinom(ssize,1,(.3 + .1*dat$Basin + 0.04*dat$GeographicRange))
##fitting model
fit <- glm(FossilRecord ~ Basin + GeographicRange, family=binomial(), data=dat)
We can use the predict()
function to obtain predicted response values for many GeographicRange values and for each Basin category.
##getting predicted response from model
plotting_dfm <- expand.grid(GeographicRange = seq(from=0, to = 20, by=0.1),
Basin = (0:1))
plotting_dfm$preds <- plogis( predict(fit , newdata=plotting_dfm))
Now you can plot the predicted results:
##plotting the predicted response on the two covariates
library(ggplot2)
pl <- ggplot(plotting_dfm, aes(x=GeographicRange, y =preds, color=as.factor(Basin)))
pl +
geom_point( ) +
ggtitle("Predicted FossilRecord by GeoRange and Basin") +
ggplot2::ylab("Predicted FossilRecord")
This will produce a figure like this:
You can plot a separate curve for each value of the categorical variable. You didn't provide sample data, so here's an example with another data set:
library(ggplot2)
# Data
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# Model. gre is continuous. rank has four categories.
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
# Predict admit probability
newdata = expand.grid(gre=seq(200,800, length.out=100), rank=1:4)
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(gre, prob, color=factor(rank), group=rank)) +
geom_line()
UPDATE: To respond to @Provisional.Modulation's comment: There are lots of options, depending on what you want to highlight and what is visually clear enough to understand, given your particular data and model output.
Here's an example using the built-in mtcars
data frame and a logistic regression with one categorical and two continuous predictor variables:
m1 = glm(vs ~ cyl + mpg + hp, data=mtcars, family=binomial)
Now we create a new data frame with the unique values of cyl
, five quantiles of hp
and a continuous sequence of mpg
, which we'll put on the x-axis (you could also of course do quantiles of mpg
and use hp
as the x-axis variable). If you have many continuous variables, you may need to set some of them to a single value, say, the median, when you graph the relationships between other variables.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=20),
hp = quantile(hp)))
newdata$prob = predict(m1, newdata, type="response")
Here are three potential graphs, with varying degrees of legibility.
ggplot(newdata, aes(mpg, prob, colour=factor(cyl))) +
geom_line() +
facet_grid(. ~ hp)
ggplot(newdata, aes(mpg, prob, colour=factor(hp), linetype=factor(cyl))) +
geom_line()
ggplot(newdata, aes(mpg, prob, colour=factor(hp))) +
geom_line() +
facet_grid(. ~ cyl)
And here's another approach using geom_tile
to include two continuous dimensions in each plot panel.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=100),
hp =seq(min(hp),max(hp),length=100)))
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(mpg, hp, fill=prob)) +
geom_tile() +
facet_grid(. ~ cyl) +
scale_fill_gradient2(low="red",mid="yellow",high="blue",midpoint=0.5,
limits=c(0,1))
If you're looking for a canned solution, the visreg package might work for you.
An example using @eipi10 's data
library(visreg)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
visreg(m1, "admit", by = "rank")
Many more options described in documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With