Suppose you are modelling binomial data where each response is a number of successes (y) from a number of trials (N) with some explanatory variables (a and b). There's a few functions that do this kind of thing, and they all seem to use different methods to specify y and N.
In glm, you do glm(cbind(y,N-y)~a+b, data = d)
(matrix of success/fail on LHS)
In inla, you do inla(y~a+b, Ntrials=d$N, data=d)
(specify number of trials separately)
In glmmBUGS, you do glmmBUGS(y+N~a+b,data=d)
(specify success + trials as terms on LHS)
When programming new methods, I've always thought it best to follow what glm does, since that's where people would normally first encounter binomial response data. However, I can never remember if its cbind(y,N-y)
or cbind(y,N)
- and I usually seem to have success/number of trials in my data rather than success/number of fails - YMMV.
Other approaches are possible of course. For example using a function on the RHS to mark whether the variable is number of trials or number of fails:
myblm( y ~ a + b + Ntrials(N), data=d)
myblm( y ~ a + b + Nfails(M), data=d) # if your dataset has succ/fail variables
or defining an operator to just do a cbind, so you can do:
myblm( y %of% N ~ a + b, data=d)
thus attaching some meaning to the LHS making it explicit.
Has anyone got any better ideas? What's the right way to do this?
1: The number of observations n is fixed. 2: Each observation is independent. 3: Each observation represents one of two outcomes ("success" or "failure"). 4: The probability of "success" p is the same for each outcome.
Binomial probability refers to the probability of exactly x successes on n repeated trials in an experiment which has two possible outcomes (commonly called a binomial experiment). If the probability of success on an individual trial is p , then the binomial probability is nCx⋅px⋅(1−p)n−x .
The expected value, or mean, of a binomial distribution is calculated by multiplying the number of trials (n) by the probability of successes (p), or n x p. For example, the expected value of the number of heads in 100 trials of heads or tales is 50, or (100 * 0.5).
The binomial distribution is given by the formula: P(X= x) = nCxpxqn-x, where = 0, 1, 2, 3, … P(X = 6) = 105/512. Hence, the probability of getting exactly 6 heads is 105/512.
You can also let y
be fraction in which case you need to supply the weights
. It is not in the formula
argument but an almost equal amount of keystrokes as if it was in the formula
. Here is an example
> set.seed(73574836)
> x <- runif(10)
> n <- sample.int(10, 2)
> y <- sapply(mapply(rbinom, size = 1, n, (1 + exp(1 - x))^-1), function(x)
+ sum(x == 1))
> df <- data.frame(y = y, frac = y / n, x = x, weights = n)
> df
y frac x weights
1 2 1.000 0.9051 2
2 5 0.625 0.3999 8
3 1 0.500 0.4649 2
4 4 0.500 0.5558 8
5 0 0.000 0.8932 2
6 3 0.375 0.1825 8
7 1 0.500 0.1879 2
8 4 0.500 0.5041 8
9 0 0.000 0.5070 2
10 3 0.375 0.3379 8
>
> # the following two fits are identical
> summary(glm(cbind(y, weights - y) ~ x, binomial(), df))
Call:
glm(formula = cbind(y, weights - y) ~ x, family = binomial(),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.731 -0.374 0.114 0.204 1.596
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.416 0.722 -0.58 0.56
x 0.588 1.522 0.39 0.70
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.5135 on 9 degrees of freedom
Residual deviance: 9.3639 on 8 degrees of freedom
AIC: 28.93
Number of Fisher Scoring iterations: 3
> summary(glm(frac ~ x, binomial(), df, weights = weights))
Call:
glm(formula = frac ~ x, family = binomial(), data = df, weights = weights)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.731 -0.374 0.114 0.204 1.596
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.416 0.722 -0.58 0.56
x 0.588 1.522 0.39 0.70
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.5135 on 9 degrees of freedom
Residual deviance: 9.3639 on 8 degrees of freedom
AIC: 28.93
Number of Fisher Scoring iterations: 3
The reason the above works comes down to what glm
actually does for binomial outcomes. It computes a fraction for each observation and a weight associated with the observation regardless of how you specify the outcome. Here is a snippet from ?glm
which gives a hint of what is going in the estimation
If a
binomial
glm
model was specified by giving a two-column response, the weights returned byprior.weights
are the total numbers of cases (factored by the supplied case weights) and the componenty
of the result is the proportion of successes.
Alternatively, you can make a wrapper for glm.fit
or glm
using model.frame
. See the ...
argument in ?model.frame
...
formodel.frame
methods, a mix of further arguments such as data,na.action
,subset
to pass to the default method. Any additional arguments (such asoffset
andweights
or other named arguments) which reach the default method are used to create further columns in the model frame, with parenthesised names such as"(offset)"
.
I saw Ben Bolker's comment afterwards. The above is what he points out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With