Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simulate data for logistic regression with fixed r2

I would like to simulate data for a logistic regression where I can specify its explained variance beforehand. Have a look at the code below. I simulate four independent variables and specify that each logit coefficient should be of size log(2)=0.69. This works nicely, the explained variance (I report Cox & Snell's r2) is 0.34.

However, I need to specify the regression coefficients in such a way that a pre-specified r2 will result from the regression. So if I would like to produce an r2 of let's say exactly 0.1. How do the coefficients need to be specified? I am kind of struggling with this..

# Create independent variables
sigma.1 <- matrix(c(1,0.25,0.25,0.25,   
                0.25,1,0.25,0.25,   
                0.25,0.25,1,0.25,    
                0.25,0.25,0.25,1),nrow=4,ncol=4)
mu.1 <- rep(0,4) 
n.obs <- 500000 

library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))

# Create latent continuous response variable 
sample1$ystar <- 0 + log(2)*sample1$V1 + log(2)*sample1$V2 + log(2)*sample1$V3 + log(2)*sample1$V4

# Construct binary response variable
sample1$prob <- exp(sample1$ystar) / (1 + exp(sample1$ystar))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)

# Logistic regression
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg)

The output is:

Call:
glm(formula = y ~ V1 + V2 + V3 + V4, family = binomial, data = sample1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.7536  -0.7795  -0.0755   0.7813   3.3382  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.002098   0.003544  -0.592    0.554    
V1           0.691034   0.004089 169.014   <2e-16 ***
V2           0.694052   0.004088 169.776   <2e-16 ***
V3           0.693222   0.004079 169.940   <2e-16 ***
V4           0.699091   0.004081 171.310   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 693146  on 499999  degrees of freedom
Residual deviance: 482506  on 499995  degrees of freedom
AIC: 482516

Number of Fisher Scoring iterations: 5

And Cox and Snell's r2 gives:

library(pscl)
pR2(logreg)["r2ML"]

> pR2(logreg)["r2ML"]
 r2ML 
0.3436523 
like image 843
Lion Behrens Avatar asked Mar 15 '18 13:03

Lion Behrens


2 Answers

If you add a random error term to the ystar variable making ystat.r and then work with that, you can tweek the standard deviation until it meets you specifications.

sample1$ystar.r <- sample1$ystar+rnorm(n.obs, 0, 3.8)  # tried a few values
sample1$prob <- exp(sample1$ystar.r) / (1 + exp(sample1$ystar.r))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg)  # the estimates "shrink"
pR2(logreg)["r2ML"]
#-------
     r2ML 
0.1014792
like image 188
IRTFM Avatar answered Oct 18 '22 22:10

IRTFM


R-squared (and its variations) is a random variable, as it depends on your simulated data. If you simulate data with the exact same parameters multiple times, you'll most likely get different values for R-squared each time. Therefore, you cannot produce a simulation where the R-squared will be exactly 0.1 just by controlling the parameters.

On the other hand, since it's a random variable, you could potentially simulate your data from a conditional distribution (conditioning on a fixed value of R-squared), but you would need to find out what these distributions look like (math might get really ugly here, cross validated is more appropriate for this part).

like image 40
Freguglia Avatar answered Oct 18 '22 22:10

Freguglia