Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What alternative ways are there to specify binomial successes/trials in a formula?

Tags:

r

formula

Suppose you are modelling binomial data where each response is a number of successes (y) from a number of trials (N) with some explanatory variables (a and b). There's a few functions that do this kind of thing, and they all seem to use different methods to specify y and N.

In glm, you do glm(cbind(y,N-y)~a+b, data = d) (matrix of success/fail on LHS)

In inla, you do inla(y~a+b, Ntrials=d$N, data=d) (specify number of trials separately)

In glmmBUGS, you do glmmBUGS(y+N~a+b,data=d) (specify success + trials as terms on LHS)

When programming new methods, I've always thought it best to follow what glm does, since that's where people would normally first encounter binomial response data. However, I can never remember if its cbind(y,N-y) or cbind(y,N) - and I usually seem to have success/number of trials in my data rather than success/number of fails - YMMV.

Other approaches are possible of course. For example using a function on the RHS to mark whether the variable is number of trials or number of fails:

myblm( y ~ a + b + Ntrials(N), data=d)
myblm( y ~ a + b + Nfails(M), data=d)  # if your dataset has succ/fail variables

or defining an operator to just do a cbind, so you can do:

myblm( y %of% N ~ a + b, data=d)

thus attaching some meaning to the LHS making it explicit.

Has anyone got any better ideas? What's the right way to do this?

like image 413
Spacedman Avatar asked Oct 24 '13 13:10

Spacedman


People also ask

What are the 4 conditions of a binomial distribution?

1: The number of observations n is fixed. 2: Each observation is independent. 3: Each observation represents one of two outcomes ("success" or "failure"). 4: The probability of "success" p is the same for each outcome.

How do you find the number of success in a binomial distribution?

Binomial probability refers to the probability of exactly x successes on n repeated trials in an experiment which has two possible outcomes (commonly called a binomial experiment). If the probability of success on an individual trial is p , then the binomial probability is nCx⋅px⋅(1−p)n−x .

What is the formula for the expected number of successes in a binomial experiment?

The expected value, or mean, of a binomial distribution is calculated by multiplying the number of trials (n) by the probability of successes (p), or n x p. For example, the expected value of the number of heads in 100 trials of heads or tales is 50, or (100 * 0.5).

How do you use the binomial distribution formula?

The binomial distribution is given by the formula: P(X= x) = nCxpxqn-x, where = 0, 1, 2, 3, … P(X = 6) = 105/512. Hence, the probability of getting exactly 6 heads is 105/512.


1 Answers

You can also let y be fraction in which case you need to supply the weights. It is not in the formula argument but an almost equal amount of keystrokes as if it was in the formula. Here is an example

> set.seed(73574836)
> x <- runif(10)
> n <- sample.int(10, 2)
> y <- sapply(mapply(rbinom, size = 1, n, (1 + exp(1 - x))^-1), function(x) 
+   sum(x == 1))
> df <- data.frame(y = y, frac = y / n, x = x, weights = n)
> df
   y  frac      x weights
1  2 1.000 0.9051       2
2  5 0.625 0.3999       8
3  1 0.500 0.4649       2
4  4 0.500 0.5558       8
5  0 0.000 0.8932       2
6  3 0.375 0.1825       8
7  1 0.500 0.1879       2
8  4 0.500 0.5041       8
9  0 0.000 0.5070       2
10 3 0.375 0.3379       8
> 
> # the following two fits are identical
> summary(glm(cbind(y, weights - y) ~ x, binomial(), df))

Call:
glm(formula = cbind(y, weights - y) ~ x, family = binomial(), 
    data = df)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.731  -0.374   0.114   0.204   1.596  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.416      0.722   -0.58     0.56
x              0.588      1.522    0.39     0.70

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9.5135  on 9  degrees of freedom
Residual deviance: 9.3639  on 8  degrees of freedom
AIC: 28.93

Number of Fisher Scoring iterations: 3

> summary(glm(frac ~ x, binomial(), df, weights = weights))

Call:
glm(formula = frac ~ x, family = binomial(), data = df, weights = weights)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.731  -0.374   0.114   0.204   1.596  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.416      0.722   -0.58     0.56
x              0.588      1.522    0.39     0.70

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9.5135  on 9  degrees of freedom
Residual deviance: 9.3639  on 8  degrees of freedom
AIC: 28.93

Number of Fisher Scoring iterations: 3

The reason the above works comes down to what glm actually does for binomial outcomes. It computes a fraction for each observation and a weight associated with the observation regardless of how you specify the outcome. Here is a snippet from ?glm which gives a hint of what is going in the estimation

If a binomial glm model was specified by giving a two-column response, the weights returned by prior.weights are the total numbers of cases (factored by the supplied case weights) and the component y of the result is the proportion of successes.

Alternatively, you can make a wrapper for glm.fit or glm using model.frame. See the ... argument in ?model.frame

... for model.frame methods, a mix of further arguments such as data, na.action, subset to pass to the default method. Any additional arguments (such as offset and weights or other named arguments) which reach the default method are used to create further columns in the model frame, with parenthesised names such as "(offset)".

Comment

I saw Ben Bolker's comment afterwards. The above is what he points out.

like image 145
Benjamin Christoffersen Avatar answered Oct 19 '22 06:10

Benjamin Christoffersen