Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

simulating an observational dataset for a teaching example

Tags:

r

I'm looking for help simulating an observational dataset in R where the correlated variables Z and D confound the relationship between the exposure (0/1) and the outcome (0-10).

This simstudy example is close, but I'm not certain about the formulas. I like the potential outcomes approach shown, and would like to bake in an effect (exposure → outcome) of 1.5 that I can recover once controlling for Z and D.

enter image description here

like image 679
Eric Green Avatar asked Sep 17 '25 13:09

Eric Green


1 Answers

If you are looking for a single data set with the specified characteristics, you could try:

set.seed(4410)

Z        <- sample(100, 1000, TRUE)
D        <- Z + rnorm(1000, 0, 20)
D        <- round((D - min(D)) / diff(range(D)) * 600 + 200)
Exposure <- rbinom(1000, 1, (D + Z) / 1000)
Outcome  <-  rbinom(1000, 10,  (Exposure + 2.3) / 6.6 + (D + Z)/2000)

df <- data.frame(Exposure, Outcome, Z, D)

Note that this gives all integers, and each of the integers is within the specified range for each variable:

head(df)
#>   Exposure Outcome  Z   D
#> 1        0       4  8 344
#> 2        0       5 15 372
#> 3        1       7 39 549
#> 4        0       6 66 508
#> 5        1       9 78 570
#> 6        1       9 28 347

lapply(df, range)
#> $Exposure
#> [1] 0 1
#> 
#> $Outcome
#> [1]  1 10
#> 
#> $Z
#> [1]   1 100
#> 
#> $D
#> [1] 200 800

We can confirm that Z and D are positively correlated with each other, and that both Z and D are positively correlated with Exposure:

cor(df$Z, df$D)
#> [1] 0.832381
cor(df$Z, df$Exposure)
#> [1] 0.235582
cor(df$D, df$Exposure)
#> [1] 0.2367981

Without taking the confounders into account, the effect of Exposure is about 1.8:

lm(Outcome ~ Exposure, data = df)
#> 
#> Call:
#> lm(formula = Outcome ~ Exposure, data = df)
#> 
#> Coefficients:
#> (Intercept)     Exposure  
#>       6.127        1.795

But taking the confounders into account, we can see that the coefficient for Exposure is 1.5

summary(lm(Outcome ~ Exposure + Z + D, data = df))
#> 
#> Call:
#> lm(formula = Outcome ~ Exposure + Z + D, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.5112 -0.9401  0.0876  0.9671  4.1952 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 3.8034249  0.2496879  15.233  < 2e-16 ***
#> Exposure    1.5000019  0.0900666  16.654  < 2e-16 ***
#> Z           0.0035773  0.0027056   1.322    0.186    
#> D           0.0045452  0.0006929   6.560 8.65e-11 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.364 on 996 degrees of freedom
#> Multiple R-squared:  0.378,  Adjusted R-squared:  0.3761 
#> F-statistic: 201.7 on 3 and 996 DF,  p-value: < 2.2e-16

Created on 2022-12-31 with reprex v2.0.2

like image 156
Allan Cameron Avatar answered Sep 19 '25 04:09

Allan Cameron