Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

No zeros predicted from zeroinfl object in R?

I created a zero inflated negative binomial model and want to investigate how many of the zeros were partitioned out to sampling or structural zeros. How do I implement this in R. The example code on the zeroinfl page is not clear to me.

data("bioChemists", package = "pscl")

fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")

table(round(predict(fm_zinb2, type="zero"))) 
>   0   1 
> 891  24 

table(round(bioChemists$art))
    >   0   1   2   3   4   5   6   7   8   9  10  11  12  16  19 
    > 275 246 178  84  67  27  17  12   1   2   1   1   2   1   1 

What is this telling me?

When I do the same for my data I get a read out that just has the sample size listed under the 1? Thanks

like image 555
marcellt Avatar asked Dec 19 '22 17:12

marcellt


1 Answers

The details are in the paper by Zeileis (2008) available at https://www.jstatsoft.org/article/view/v027i08/v27i08.pdf

It's a little bit of work (a couple of years and your question was still unanswered) to gather together all the explanations of what the predict function does for each model in the pscl library, and it's buried (pp 19,23) in the mathematical expression of the likelihood function (equations 7, 8). I have interpreted your question to mean that you want/need to know how to use different types of predict:

  • What is the expected count? (type="response")
  • What is the (conditional) expected probability of an excess zero? (type="zero")
  • What is the (marginal) expected probability of any count? (type="prob")
  • And finally how many predicted zeros are excess (eg sampling) rather than regression based (ie structural)?

To read in the data that comes with the pscl package:

data("bioChemists", package = "pscl")

Then fit a zero-inflated negative binomial model:

fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")

If you wish to predict the expected values, then you use

predict(fm_zinb2, type="response")[29:31]
       29        30        31 
0.5213736 1.7774268 0.5136430

So under this model, the expected number of articles published in the last 3 years of a PhD is one half for biochemists 29 and 31 and nearly 2 for biochemist 30.

But I believe that you are after the probability of an excess zero (in the point mass at zero). This command does that and prints off the values for items in row 29 to 31 (yes I went fishing!):

predict(fm_zinb2, type="zero")[29:31]

It produces this output:

        29         30         31 
0.58120120 0.01182628 0.58761308 

So the probability that the 29th item is an excess zero (which you refer to as a sampling zero, i.e. a non-structural zero and hence not explained by the covariates) is 58%, for the 30th is 1.1%, and for the 31st is 59%. So that's two biochemists who are predicted to have zero publications, and this is in excess of those that can be explained by the negative binomial regression on the various covariates.

And you have tabulated these predicted probabilities across the whole dataset

table(round(predict(fm_zinb2, type="zero"))) 
  0   1 
891  24

So your output tells you that only 24 biochemists were likely to be an excess zero, ie with a predicted probability of an excess zero that was over one-half (due to rounding).

It would perhaps be easier to interpret if you tabulated into bins of 10 points on the percentage scale

table(cut(predict(fm_zinb2, type="zero"), breaks=seq(from=0,to=1,by=0.1))) 

to give

 (0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] 
     751        73        34        23        10        22 
(0.6,0.7] (0.7,0.8] (0.8,0.9]   (0.9,1] 
        2         0         0         0

So you can see that 751 biochemists were unlikely to be an excess zero, but 22 biochemists have a chance of between 50-60% of being an excess zero, and only 2 have a higher chance (60-70%). No one was extremely likely to be an excess zero. Graphically, this can be shown in a histogram

hist(predict(fm_zinb2, type="zero"), col="slateblue", breaks=seq(0,0.7,by=.02))

You tabulated the actual number of counts per biochemist (no rounding necessary, as these are counts):

table(bioChemists$art)
  0   1   2   3   4   5   6   7   8   9  10  11  12  16  19 
275 246 178  84  67  27  17  12   1   2   1   1   2   1   1

Who is the special biochemist with 19 publications?

most_pubs <- max(bioChemists$art)
most_pubs
extreme_biochemist <- bioChemists$art==most_pubs
which(extreme_biochemist)

You can obtain the estimated probability that each biochemist has any number of pubs, exactly 0 and up to the maximum, here an unbelievable 19!

preds <- predict(fm_zinb2, type="prob")
preds[extreme_biochemist,]

and you can look at this for our one special biochemist, who had 19 publications (using base R plotting here, but ggplot is more beautiful)

expected <- predict(fm_zinb2, type="response")[extreme_biochemist]
# barplot returns the midpoints for counts 0 up to 19
midpoints<-barplot(preds[extreme_biochemist,], 
  xlab="Predicted #pubs", ylab="Relative chance among biochemists")
# add 1 because the first count is 0
abline(v=midpoints[19+1],col="red",lwd=3)
abline(v=midpoints[round(expected)+1],col="yellow",lwd=3)

and this shows that although we expect 4.73 publications for biochemist 915, under this model more likelihood is given to 2-3 pubs, nowhere near the actual 19 pubs (red line).

Chance of #pubs for biochemist profile 29

Getting back to the question, for biochemist 29, the probability of an excess zero is

pzero <- predict(fm_zinb2, type="zero")
pzero[29]
       29 
0.5812012 

The probability of a zero, overall (marginally) is

preds[29,1]
[1] 0.7320871

So the proportion of predicted probability of a zero that is excess versus structural (ie explained by the regression) is:

pzero[29]/preds[29,1]
       29 
0.7938962

Or the additional probability of a zero, beyond the chance of an excess zero is:

preds[29,1] - pzero[29]

       29 
0.1508859

The actual number of publications for biochemist 29 is

bioChemists$art[29]
[1] 0

So the reason that biochemist is predicted to have zero publications is very little explained by the regression (20%) and mostly not (ie excess, 80%).

And overall, we see that for most biochemists, this is not the case. Our biochemist 29 is unusual, since their chance of zero pubs is mostly excess, ie inexplicable by the regression. We can see this via:

hist(pzero/preds[,1], col="blue", xlab="Proportion of predicted probability of zero that is excess")

which gives you:

Proportion of predicted probability of zero that is excess, across biochemists

like image 98
slouchy Avatar answered Jan 10 '23 18:01

slouchy