Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpretation and plotting of logistic regression

I'm currently working on a research involving different species of bats and fragmentation of the habitat. My dataset contains presence data (1 = present, 0 = absent) and data on the fragment sizes, body mass (both continuous) and feeding guilds (Feeding.Guild; categorical, 6 levels: carnivore, frugivore insectivore, nectarivore, omnivore and sanguinivore). The fragment sizes (logFrag) and body masses (logMass) are transformed using the natural log to conform to the normal distribution. I can't present the full data set (bats2) due to being classified.

To analyse this data, I use logistic regression. In R, this is the glm function with the binomial family.

bats2 <- read.csv("Data_StackExchange.csv", 
                 quote = "", sep=";", dec = ".", header=T, row.names=NULL)
bats2$presence <- ifelse(bats2$Corrected.Abundance == 0, 0, 1)
bats2$logFrag <- log(bats2$FragSize)
bats2$logMass <- log(bats2$Mass)
str(bats2$Feeding.Guild)
     Factor w/ 6 levels "carnivore","frugivore",..: 6 1 5 5 2 2 2 2 2 2 ...    
levels(bats2$Feeding.Guild)
    [1] "carnivore"    "frugivore"    "insectivore"  "nectarivore"  "omnivore"     "sanguinivore"


regPresence <- glm(bats2$presence~(logFrag+logMass+Feeding.Guild), 
                   family="binomial", data=bats2)

The results of this regression are obtained by the summary() function and are as follows.

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -4.47240    0.64657  -6.917 4.61e-12 ***
logFrag                    0.10448    0.03507   2.979 0.002892 ** 
logMass                    0.39404    0.09620   4.096 4.20e-05 ***
Feeding.Guildfrugivore     3.36245    0.49378   6.810 9.78e-12 ***
Feeding.Guildinsectivore   1.97198    0.51136   3.856 0.000115 ***
Feeding.Guildnectarivore   3.85692    0.55379   6.965 3.29e-12 ***
Feeding.Guildomnivore      1.75081    0.51864   3.376 0.000736 ***
Feeding.Guildsanguinivore  1.73381    0.56881   3.048 0.002303 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

My first question is to verify that I interpret this data correctly: how to interpret this data correctly? I used this website to aid me in interpreting.

Additionally, I tried plotting this data in order to visualise it. However, when adding the facet_wrap function to make seperate plots for the different feeding guilds, the intercepts and slopes change compared to colouring the different feeding guilds in one plot. I used the following code:

Plot 1:

library(ggplot2)
qplot(logFrag, bats2$presence, colour=Feeding.Guild, data=bats2, se=F) +
  geom_smooth(method = glm, family = "binomial", se=F, na.rm=T) + theme_bw() 

Plot 2:

qplot(logFrag, bats2$presence, data=bats2, se=F) + facet_wrap(~Feeding.Guild, 
                                                              scales="free") +
  geom_smooth(method = glm, family = "binomial", se=F, na.rm=T) + theme_bw() 

Resulting in the following images:

Plot 1 (left) and plot 2 (right).

What causes these differences and which one would be the correct one?

Sample data set (part of data set which is not classified).

like image 316
Y. Hidskes Avatar asked Oct 30 '22 01:10

Y. Hidskes


1 Answers

The resource you have linked to has an explanation of the interpretation in the bulleted section under the heading Using the logit model. Estimate is each covariate's additive effect on the log-odds of presence. This is per 1 unit increase of a continuous co-variate or per instance of the categorical. A couple of points on this:

  • Because you have taken the log of the continuous covariates, their effect is per 1 unit on the log scale - quite hard to interpret. I would strongly advise against this. There is no requirement for normality of FragSize or Mass in order to fit this model.
  • Notice that one of your categories is missing from the list? The effect of the covariates has to be measured with respect to some reference. In this case the reference is carnivores with logFrag=0 and logMass=0. These 0 values are impossible. This is usual, and why interpretation of the (Intercept) is not useful for you.

On to Std. Error, this is a measure of your confidence in your Estimate effects. People often use a normal approximation of +- 2*Std. Error around the Estimate to form confidence intervals and make statements using them. When the interval of +- 2*Std. Error contains 0 there is some probability that the true effect is 0. You don't want that, so you're looking for small values of Std. Error with respect to the Estimate

z value and Pr(>|z|) relate to the normal approximation I mentioned. You probably already know what a Z score (Standard Normal) is and how people use them to perform significance tests.

Now to your plots: The plots aren't actually plotting your model. You are using the smoother to fit a new model of a similar type but over a different set of data. The smoother is only considering the effect of logFrag to fit a mini logistic model within each guild.

So we expect the plots to differ from the summary(), but not from eachother. The reason this has happened is interesting and it's to do with using bats2$presence instead of presence. When you pass in bats2$presence, this effectively like passing ggplot2 a separate anonymous list of data. So long as that list aligns with the dataframe as you would expect, all is well. It seems that facet_wrap() mixes up the data when using bats2$presence, probably due to sorting bats2 by guild. Use plain old presence and they'll come out the same.

like image 112
MilesMcBain Avatar answered Nov 14 '22 21:11

MilesMcBain