What alternative ways are there to specify binomial successes/trials in a formula?

Tags:

Suppose you are modelling binomial data where each response is a number of successes (y) from a number of trials (N) with some explanatory variables (a and b). There's a few functions that do this kind of thing, and they all seem to use different methods to specify y and N.

In glm, you do glm(cbind(y,N-y)~a+b, data = d) (matrix of success/fail on LHS)

In inla, you do inla(y~a+b, Ntrials=d$N, data=d) (specify number of trials separately)

In glmmBUGS, you do glmmBUGS(y+N~a+b,data=d) (specify success + trials as terms on LHS)

When programming new methods, I've always thought it best to follow what glm does, since that's where people would normally first encounter binomial response data. However, I can never remember if its cbind(y,N-y) or cbind(y,N) - and I usually seem to have success/number of trials in my data rather than success/number of fails - YMMV.

Other approaches are possible of course. For example using a function on the RHS to mark whether the variable is number of trials or number of fails:

myblm( y ~ a + b + Ntrials(N), data=d)
myblm( y ~ a + b + Nfails(M), data=d)  # if your dataset has succ/fail variables

or defining an operator to just do a cbind, so you can do:

myblm( y %of% N ~ a + b, data=d)

thus attaching some meaning to the LHS making it explicit.

Has anyone got any better ideas? What's the right way to do this?

413

asked Oct 24 '13 13:10

Spacedman

1 Answers

You can also let y be fraction in which case you need to supply the weights. It is not in the formula argument but an almost equal amount of keystrokes as if it was in the formula. Here is an example

> set.seed(73574836)
> x <- runif(10)
> n <- sample.int(10, 2)
> y <- sapply(mapply(rbinom, size = 1, n, (1 + exp(1 - x))^-1), function(x) 
+   sum(x == 1))
> df <- data.frame(y = y, frac = y / n, x = x, weights = n)
> df
   y  frac      x weights
1  2 1.000 0.9051       2
2  5 0.625 0.3999       8
3  1 0.500 0.4649       2
4  4 0.500 0.5558       8
5  0 0.000 0.8932       2
6  3 0.375 0.1825       8
7  1 0.500 0.1879       2
8  4 0.500 0.5041       8
9  0 0.000 0.5070       2
10 3 0.375 0.3379       8
> 
> # the following two fits are identical
> summary(glm(cbind(y, weights - y) ~ x, binomial(), df))

Call:
glm(formula = cbind(y, weights - y) ~ x, family = binomial(), 
    data = df)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.731  -0.374   0.114   0.204   1.596  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.416      0.722   -0.58     0.56
x              0.588      1.522    0.39     0.70

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9.5135  on 9  degrees of freedom
Residual deviance: 9.3639  on 8  degrees of freedom
AIC: 28.93

Number of Fisher Scoring iterations: 3

> summary(glm(frac ~ x, binomial(), df, weights = weights))

Call:
glm(formula = frac ~ x, family = binomial(), data = df, weights = weights)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.731  -0.374   0.114   0.204   1.596  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.416      0.722   -0.58     0.56
x              0.588      1.522    0.39     0.70

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9.5135  on 9  degrees of freedom
Residual deviance: 9.3639  on 8  degrees of freedom
AIC: 28.93

Number of Fisher Scoring iterations: 3

The reason the above works comes down to what glm actually does for binomial outcomes. It computes a fraction for each observation and a weight associated with the observation regardless of how you specify the outcome. Here is a snippet from ?glm which gives a hint of what is going in the estimation

If a binomial glm model was specified by giving a two-column response, the weights returned by prior.weights are the total numbers of cases (factored by the supplied case weights) and the component y of the result is the proportion of successes.

Alternatively, you can make a wrapper for glm.fit or glm using model.frame. See the ... argument in ?model.frame

... for model.frame methods, a mix of further arguments such as data, na.action, subset to pass to the default method. Any additional arguments (such as offset and weights or other named arguments) which reach the default method are used to create further columns in the model frame, with parenthesised names such as "(offset)".

Comment

I saw Ben Bolker's comment afterwards. The above is what he points out.

145

answered Oct 19 '22 06:10

Benjamin Christoffersen

Related questions
                            
                                Splitting String based on letters case
                            
                                What is the difference between these two comparisons? [duplicate]
                            
                                Implementation of skyline query or efficient frontier
                            
                                R - count all combinations
                            
                                How to interpret lm() coefficient estimates when using bs() function for splines
                            
                                Public Amazon EC2 AMIs with R pre-installed
                            
                                package doMC NOT available for R version 3.0.0 warning in install.packages
                            
                                assign a value, if a number is in between two numbers
                            
                                How to check whether R is already installed in Ubuntu? [closed]
                            
                                In R, use lubridate to convert hms objects into seconds
                            
                                How to get top n companies from a data frame in decreasing order
                            
                                writing functions vs. line-by-line interpretation in an R workflow
                            
                                Complement a DNA sequence
                            
                                How to split a data frame by rows, and then process the blocks?
                            
                                R duplicate a matrix several times and then bind by rows together
                            
                                Simulating a Random Walk
                            
                                calculating the outliers in R
                            
                                How to show corpus text in R tm package?
                            
                                Remove outliers from correlation coefficient calculation
                            
                                R socketConnection/make.socket(): any way to keep listen()ing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With