Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Predict (0,1) in logistic regression in glm()

Tags:

r

statistics

I am trying to model a "what if" situation in a binary logit model. I am estimating the probability of passing a test, given the level of difficulty of the test (1=easiest, 5=toughest), with gender as control. (The data is here). Students are administered a test which is generally tough ("HIGH" in the data). From this we can estimate the impact of test-difficulty on the likelihood of passing:

model = glm(PASS ~ as.factor(SEX) + as.factor(HIGH), family=binomial(link="logit"), data=df)
summary(model)

We can also get the predicted probabilities of passing with:

predict.high = predict(model, type="response")

The question is, what if the "LOW" test were given instead? To get the new probabilities, we can do:

newdata = rename.vars(subset(df, select=c(-HIGH)), 'LOW','HIGH')
predict.low = predict(model, newdata=newdata, type="response")

But how do I know how many additional students would have passed in this case? Is there an obvious switch in glm() I am not seeing?

like image 683
user702432 Avatar asked Aug 30 '25 18:08

user702432


1 Answers

I have not yet tried to dig out my code for prediction that I wrote based on Gelman and Hill (2006) who, I seem to recall used simulation. I still intend to do that. One aspect of your question that seemed unique in my limited experience was that I was accustomed to predicting for a single observation (in this case a single student taking a single test). You, however, seem to want to predict a difference between two sets of predictions. In other words, you want to predict how many more students will pass if given a set of 5 easy exams rather than a set of 5 hard exams.

I am not sure whether Gelman and Hill (2006) covered that. You also seem to want to do this with a frequentist approach.

I am thinking that if you can predict for a single observation, so that you have a confidence interval for each observation, then perhaps you can estimate a weighted average probability of passing within each group and subtract the two weighted averages. The delta method could be used to estimate a confidence interval on the weighted averages and on their difference.

Covariance among predicted observations might have to be assumed to be 0 to implement that approach.

If assuming a covariance of 0 is not satisfactory then perhaps a Bayesian approach would be better. Again, I am only familiar with predicting for a single observation. With a Bayesian approach I have predicted a single observation by including the independent variables, but not the dependent variable, for the observation to be predicted. I suppose you could predict for every observation in the same Bayesian run (predict each student in HIGH and in LOW). The weighted averages of passing tests for each group and the difference in weighted averages are derived parameters and I suspect could be included directly in the code for the Bayesian logistic regression. Then you would have your point estimate and estimate of variance for probability of passing each group of tests and for the difference in probability of passing each group of tests. If you want the difference in the number of students passing each group of tests, perhaps that could be included in the Bayesian code as a derived parameter also.

I realize this answer, so far, has been more conversational than might be desired. I am simply mapping out strategies to attempt without having had the time yet to try implementing those strategies. Providing all of the R and WinBUGS code to implement both proposed strategies might take me a few days. (WinBUGS or OpenBUGS can be called from within R.) I will append the code to this answer as I go along. If anyone deems my proposed strategies, and/or forthcoming code, incorrect I hope they will feel free to point out my errors and offer corrections.

EDIT

Below is code that generates fake data and analyzes that data using a frequentist and Bayesian approach. I have not yet added the code to implement the above ideas for prediction. I will try to add the Bayesian prediction code in the next 1-2 days. I only used three tests instead of five. The way the code is written below you can change the number of students, n, to any non-zero number that can be divided into 6 equal whole numbers.

# Bayesian_logistic_regression_June2012.r
# June 24, 2012

library(R2WinBUGS)
library(arm)
library(BRugs)

set.seed(3234)


# create fake data for n students and three tests

n <- 1200

# create factors for n/6 students in each of 6 categories

gender <- c(rep(0, (n/2)), rep(1, (n/2)))
test2  <- c(rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)),
            rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)))
test3  <- c(rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)),
            rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)))

# assign slopes to factors

B0      <-  0.4
Bgender <- -0.2
Btest2  <-  0.6
Btest3  <-  1.2

# estimate probability of passing test

p.pass <- (     exp(B0 + Bgender * gender + 
                         Btest2  * test2  + 
                         Btest3  * test3) /
           (1 + exp(B0 + Bgender * gender +
                         Btest2  * test2  + 
                         Btest3  * test3)))

# identify which students passed their test, 0 = fail, 1 = pass

passed   <- rep(0, n)
r.passed <- runif(n,0,1)
passed[r.passed <= p.pass] = 1

# use frequentist approach in R to estimate probability
# of passing test

m.freq <- glm(passed ~ as.factor(gender) +
                       as.factor(test2)  +
                       as.factor(test3)  , 
                       family = binomial)
summary(m.freq)

# predict(m.freq, type = "response")


# use OpenBUGS to analyze same data set

# Define model

sink("Bayesian.logistic.regression.txt")
cat("
model {

# Priors

 alpha ~ dnorm(0,0.01)
 bgender ~ dnorm(0,0.01)
 btest2 ~ dnorm(0,0.01)
 btest3 ~ dnorm(0,0.01)

# Likelihood

 for (i in 1:n) {
    passed[i] ~ dbin(p[i], 1)
    logit(p[i]) <- (alpha + bgender * gender[i] +
                            btest2  * test2[i]  +
                            btest3  * test3[i])
 }

# Derived parameters

 p.g.t1 <- exp(alpha) / (1 + exp(alpha))
 p.b.t1 <- exp(alpha + bgender) / (1 + exp(alpha + bgender))

 p.g.t2 <- (    exp(alpha +           btest2) / 
           (1 + exp(alpha +           btest2)))
 p.b.t2 <- (    exp(alpha + bgender + btest2) / 
           (1 + exp(alpha + bgender + btest2)))

 p.g.t3 <- (    exp(alpha +           btest3) / 
           (1 + exp(alpha +           btest3)))
 p.b.t3 <- (    exp(alpha + bgender + btest3) / 
           (1 + exp(alpha + bgender + btest3)))

}

", fill = TRUE)
sink()

my.data <- list(passed = passed, 
                gender = gender,
                test2  = test2,
                test3  = test3, 
                n      = length(passed))

# Inits function

inits <- function(){ list(alpha   = rlnorm(1), 
                          bgender = rlnorm(1),
                          btest2  = rlnorm(1),
                          btest3  = rlnorm(1)) }

# Parameters to estimate

params <- c("alpha", "bgender", "btest2", "btest3", 
            "p.g.t1", "p.b.t1", "p.g.t2", "p.b.t2",
            "p.g.t3", "p.b.t3")

# MCMC settings

nc <- 3
ni <- 2000
nb <- 500
nt <- 2

# Start Gibbs sampling

out <- bugs(data = my.data, inits = inits,
parameters.to.save = params, 
"c:/users/Mark W Miller/documents/Bayesian.logistic.regression.txt",
program = 'OpenBUGS', 
n.thin = nt, n.chains = nc, 
n.burnin = nb, n.iter = ni, debug = TRUE)

print(out, dig = 5)

Before I attempt to implement the weighted-average approach to prediction I wanted to convince myself that it might work. So I ginned up the following code, which seems to suggest it may:

# specify number of girls taking each test and
# number of boys taking each test

g.t1 <- rep(0,400)
b.t1 <- rep(0,120)
g.t2 <- rep(0,1200)
b.t2 <- rep(0,50)
g.t3 <- rep(0,1000)
b.t3 <- rep(0,2000)

# specify probability of individuals in each of the
# 6 groups passing their test

p.g1.t1 <- 0.40
p.b1.t1 <- 0.30
p.g1.t2 <- 0.60
p.b1.t2 <- 0.50
p.g1.t3 <- 0.80
p.b1.t3 <- 0.70

# identify which individuals in each group passed their test

g.t1[1:(p.g1.t1 * length(g.t1))] = 1
sum(g.t1)

b.t1[1:(p.b1.t1 * length(b.t1))] = 1
sum(b.t1)

g.t2[1:(p.g1.t2 * length(g.t2))] = 1
sum(g.t2)

b.t2[1:(p.b1.t2 * length(b.t2))] = 1
sum(b.t2)

g.t3[1:(p.g1.t3 * length(g.t3))] = 1
sum(g.t3)

b.t3[1:(p.b1.t3 * length(b.t3))] = 1
sum(b.t3)

# determine the weighted average probability of passing
# on test day for all individuals as a class

wt.ave.p <- ((p.g1.t1 * length(g.t1) + p.b1.t1 * length(b.t1) +
 p.g1.t2 * length(g.t2) + p.b1.t2 * length(b.t2) +
 p.g1.t3 * length(g.t3) + p.b1.t3 * length(b.t3) ) / 

 (length(g.t1) + length(b.t1) + length(g.t2) + 
  length(b.t2) + length(g.t3) + length(b.t3)))

wt.ave.p

# determine the expected number of individuals passing
# their test in the class as a whole

exp.num.pass <- wt.ave.p *  (length(g.t1) + length(b.t1) +
                             length(g.t2) + length(b.t2) +
                             length(g.t3) + length(b.t3))
exp.num.pass

# determine the number of individuals passing

num.passing <- (sum(g.t1) + sum(b.t1) + 
                sum(g.t2) + sum(b.t2) + 
                sum(g.t3) + sum(b.t3) )
num.passing

# the expected number of students passing, exp.num.pass,
# should equal the observed number of students passing,
# num.passing regardless of the number of students in each
# group and regardless of the probability of passing a 
# given test, within rounding error

identical(round(exp.num.pass), round(num.passing)) 

Hopefully in the next couple of days I can try adding the prediction code to the above Bayesian code.

EDIT - June 27, 2012

I have not forgotten about this. Rather, I have encountered several problems:

  1. With logistic regression it is possible to predict: a) the probability, p, that students in a given group pass a test and b) the outcome of a given student taking a test (0 or 1). All of the 0's and 1's are then averaged. I am not sure which of these to use. The point estimate and SD of the predicted p is identical to the estimated p for known test outcomes. The point estimate of the average of the predicted 0's and 1's is a little different and the SD of the averaged 0's and 1's is much larger. I believe I want b, the average of the predicted 0's and 1's. However, I am attempting to examine various websites and books to be sure. Collett (1991) has a worked example that does not employ computer code, but that worked example includes a half-dozen variables including 2 interactions and I am having a little trouble getting my Bayesian estimates to match her frequentist estimates.

  2. With lots of derived parameters the program is taking a long time to run.

  3. Apparently OpenBUGS has been crashing frequently, I believe, even without prediction code. I am not sure whether that is because of something I am doing wrong or because of changes in the recent versions of R or changes in recent versions of R packages or maybe because I am trying to run the code with a 64-bit R or something else.

I will try to post the prediction code soon, but all of the above issues have slowed me down.

like image 70
Mark Miller Avatar answered Sep 02 '25 12:09

Mark Miller