<p>I have a dataset with both missing-at-random (MAR) and censored data. The variables are correlated and I am trying to impute the missing data conditionally so that I can estimate the distribution parameters for a correlated multivariate normal distribution. I would like to use Gibbs MCMC, but am having difficulty implementing the procedure. My dataframe has 5 variables (denoted x1:x5), 1099 samples which contain some combination of MAR, censored and observed values. This is what I have tried so far:</p> <pre class="prettyprint"><code># packages library(msm, tmvtnorm, MCMCpack) # priors theta0<-c(rep(0, 5)) Sigma0<-S0<-diag(5) nu0<-4 # initialize parameters theta<-c(rep(0, 5)) Tau<-diag(5) # initialize output matrix n_samples <- 1000 mu_MCMC <- matrix(0, nrow = n_samples, ncol = 5) mu_MCMC[1,] <- theta cov_MCMC <- matrix(0, nrow = n_samples, ncol = 25) cov_MCMC[1,] <- c(diag(5)) # detection limits det_lim <- matrix(c(-1.7, 0, 0, 0, 0), nrow = 1, ncol = 5) # function to detect NaN (i.e., below detection data) is.nan.data.frame <- function(x) do.call(cbind, lapply(x, is.nan)) for(i in 2:n_samples){ imputedDF <- data.frame() for(r in 1:nrow(originalDF)){ # variables that are MAR or censored mis <- r[, is.na(r) & is.nan(r)] # variables that are observed obs <- r[, !is.na(r)] # subset mu for missing, observed mu1 <- mu[, names(r) %in% names(mis)] mu2 <- mu[, names(r) %in% names(obs)] # calculate sigmas for MVN partitions of mis, obs sigma11 <- sigma[names(r) %in% names(mis), names(r) %in% names(mis)] sigma22 <- sigma[names(r) %in% names(obs), names(r) %in% names(obs)] sigma12 <- sigma[names(r) %in% names(obs), names(r) %in% names(mis)] sigma21 <- t(sigma12) # create matrix for detection limits based on missing values ## if NaN, use detection limit; if NA use Inf dl <- c(ifelse("x1" %in% names(is.nan(r)), det_lim[1, "x1"], Inf), ifelse("x2" %in% names(is.nan(r)), det_lim[1, "x2"], Inf), ifelse("x3" %in% names(is.nan(r)), det_lim[1, "x3"], Inf), ifelse("x4" %in% names(is.nan(r)), det_lim[1, "x4"], Inf), ifelse("x5" %in% names(is.nan(r)), det_lim[1, "x5"], Inf)) # compute mu, sigma to use for conditional MVN ## if all values are missing if(length(names(obs) == 0) { mu_mis <- mu1 sigma_mis <- sigma11 ## otherwise } else { mu_mis <- mu1 + sigma12 %*% solve(sigma22) * (obs - t(mu2)) sigma_mis <- sigma11 - sigma12 %*% solve(sigma22) %*% sigma21 } # imputation ## if all data are observed, missing is empty if(length(obs) == 0) { mis_impute <- data.frame() ## only need to impute a single value } else if(length(names(mis)) == 1) { mis_impute <- rtnorm(1, mean = mu_mis, sd = sigma_mis, lower = -Inf, upper = dl) ## have more than one missing value } else { mis_impute <- rtmvnorm(1, mean = mu_mis, sigma = sigma_mis, lower = rep(-Inf, length = length(names(mis))), upper = dl) } # merge observed values with simulated ## if all values observed if(length(names(mis)) == 0) { sim_result <- obs } else { sim_result <- cbind(mis_impute, obs) } imputedDF <- rbind(imputedDF, sim_result) } # update theta v <- solve(solve(Sigma0) + nrow(sim_result)*Tau) m <- v %*% (solve(Sigma0) %*% theta0 + Tau %*% apply(sim_result,2,sum)) mu <- as.data.frame(rmvnorm(1,m,v)) mu_MCMC[i,] <- mu # update Sigma tmp <- t(sim_result) - mu Tau <- rwish(nu0 + nrow(sim_result), solve(S0 + t(tmp) %*% tmp)) sigma <- matrix(c(solve(Tau)), nrow = 5, ncol = 5, byrow = TRUE) cov_MCMC[i,] <- c(solve(Tau)) } </code></pre> <p>I keep running into errors because the imputation returns NaN and NA values, but I can't figure out what is going wrong because when I test it just using the inner loop to impute the data, it seems to work. Thus, the issue seems to be the parameter updating but I can't figure it out! </p>

<h3>Preamble:</h3> <p>My sense is that part of the problem here is we do not have a good example dataset to work off. </p> <p>My feeling is we can address this by creating an example dataset to frame the solution discussion. A useful package to this end is the Wakefield package that allows for the creation of simulated datasets.</p> <p>We might, for example, create a dataset of 2000 people, where some of the ages, gender, employment status, education data, and marital status information is missing. </p> <h3>Imputation</h3> <p>The core question is can we impute the age or gender from other data in the data set? </p> <p>For example, if we do not know someone's age, can we impute it from their marital status, employment type and or their Education level? At a very simplistic level, we might simply search for entries with NA for age, and look at Marital status. If the marital status is "married", then we impute that our data set is for American's and look up at the average age for marriage and replace with an estimated age for a married person. </p> <p>We can expand on this and make our estimates more accurate by taking into account more variables. For example, we might look at both Marital Status, Education level and Employment status to further improve our age estimate. If a person is married, with a Ph.D., and retired we push the age upwards. If a person is single, a student we push the age lower. Further to this, we can look at the distribution of the ages in the data set to impute data about missing values.</p> <h3>Generate an Example Data Set.</h3> <pre class="prettyprint"><code># packages requiredPackages <- c("wakefield", "dplyr", "BaylorEdPsych", "mice", "MCMCpack") ipak <- function(pkg) { new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])] if (length(new.pkg)) { install.packages(new.pkg, dependencies = TRUE) } sapply(pkg, require, character.only = TRUE) } ipak(requiredPackages) # generate some data for Males with a 8% missing value for # age set.seed(10) f.df <- r_data_frame(n = 1000, age, gender(x = c("M", "F"), prob = c(0, 1), name = "Gender"), employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"), education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma", "Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree", "Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"), prob = c(0.013, 0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072, 0.019, 0.012), name = "Education"), marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.05) # str(f.df) summary(f.df) set.seed(20) # generate some data for Females with a 5% missing value for # age m.df <- r_data_frame(n = 1000, age, gender(x = c("M", "F"), prob = c(1, 0), name = "Gender"), employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"), education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma", "Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree", "Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"), prob = c(0.013,0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072,0.019, 0.012), name = "Education"), marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.03) summary(m.df) all.df = rbind.data.frame(m.df, f.df) summary(all.df) </code></pre> <h3>Data Summary</h3> <pre class="prettyprint"><code>> summary(all.df) Age Gender Employment Education Marital Min. :18.00 M:1000 Full Time :1142 Regular High School Diploma :459 Married :394 1st Qu.:35.00 F:1000 Part Time : 207 Bachelor's Degree :356 Divorced :378 Median :54.00 Unemployed: 193 Some College, 1 or More Years, No Degree:284 Widowed :411 Mean :53.76 Retired : 182 9th Grade to 12th Grade, No Diploma :156 Separated :379 3rd Qu.:72.00 Student : 196 Associate's Degree :145 Never Married:358 Max. :89.00 NA's : 80 (Other) :520 NA's : 80 NA's :80 NA's : 80 > </code></pre> <h3>Is data Missing completely at Random or Not Missing at Random?</h3> <pre class="prettyprint"><code># Test for MCAR - Missing at Completely at Random... test_mcar <- LittleMCAR(all.df) print(test_mcar$amount.missing) print(test_mcar$p.value) </code></pre> <h3>Console Output</h3> <pre class="prettyprint"><code>> # Test for MCAR - Missing at Completely at Random... > test_mcar <- LittleMCAR(all.df) this could take a while> print(test_mcar$amount.missing) Age Gender Employment Education Marital Number Missing 80.00 0 80.00 80.00 80.00 Percent Missing 0.04 0 0.04 0.04 0.04 > print(test_mcar$p.value) [1] 0.02661428 </code></pre> <h3>Imputation of Data</h3> <p>Ok, let us first look at the distribution of missing values. We can run mice::md.pattern() function, to show the distribution of the missing values over the other columns in the dataframe. The <code>md.pattern()</code> function output is useful for suggesting which variables might be good candidates to use for imputing the missing values:</p> <pre class="prettyprint"><code>> md.pattern(all.df) Gender Age Employment Education Marital 1696 1 1 1 1 1 0 73 1 1 1 1 0 1 73 1 1 1 0 1 1 2 1 1 1 0 0 2 71 1 1 0 1 1 1 3 1 1 0 1 0 2 2 1 1 0 0 1 2 71 1 0 1 1 1 1 2 1 0 1 1 0 2 3 1 0 1 0 1 2 4 1 0 0 1 1 2 0 80 80 80 80 320 </code></pre> <p>Ok, from this we can now move to impute the missing values:</p> <pre class="prettyprint"><code>imp <- mice(all.df, m = 5, maxit = 50, seed = 1234, printFlag = FALSE) </code></pre> <ul> <li>The m=5 parameter specifies that you end up with five plausible imputations for the variable</li> <li>The maxit=50 parameter specifies that there will be up to 50 iterations of the algorithm before it converges to a solution and this can be adjusted upward or downward to the desired precision</li> </ul> <p>The mice() function may take a while depending upon the number of iterations we specify. In this case, upon completion we can see some of the imputed values for Age using head() function:</p> <pre class="prettyprint"><code>head(imp$imp$Age) 1 2 3 4 5 7 28 49 37 70 89 33 55 54 52 88 24 56 89 83 68 71 61 84 43 43 24 30 31 96 28 64 89 41 50 120 47 34 36 22 77 </code></pre> <p>To actually complete the imputation, we have to run the <code>complete()</code> function and assign the results to a new dataframe. This version of <code>complete()</code> function will collect all imputations in the assigned dataframe via the "long" parameter:</p> <pre class="prettyprint"><code>all_imputed_df <- complete(imp, "long", include = TRUE) table(all_imputed_df$.imp, is.na(all_imputed_df$Age)) </code></pre> <h3>Console:</h3> <pre class="prettyprint"><code>> all_imputed_df <- complete(imp, "long", include = TRUE) > table(all_imputed_df$.imp, is.na(all_imputed_df$Age)) FALSE TRUE 0 1920 80 1 2000 0 2 2000 0 3 2000 0 4 2000 0 5 2000 0 </code></pre> <p>Now we have a dataset of 12000 age values, across 5 age inputted values.</p> <p>Let's try a regression with imputation #3.</p> <p>First, extract impute #3</p> <pre class="prettyprint"><code>impute.3 <- subset(all_imputed_df,.imp=='3') summary(impute.3) </code></pre> <h3>Console:</h3> <pre class="prettyprint"><code>> impute.3 <- subset(all_imputed_df, .imp == "3") > summary(impute.3) .imp .id Age Gender Employment Min. :3 Min. : 1.0 Min. :18.00 M:1000 Full Time :1192 1st Qu.:3 1st Qu.: 500.8 1st Qu.:35.00 F:1000 Part Time : 211 Median :3 Median :1000.5 Median :54.00 Unemployed: 202 Mean :3 Mean :1000.5 Mean :53.89 Retired : 191 3rd Qu.:3 3rd Qu.:1500.2 3rd Qu.:72.00 Student : 204 Max. :3 Max. :2000.0 Max. :89.00 Education Marital Regular High School Diploma :478 Married :416 Bachelor's Degree :376 Divorced :390 Some College, 1 or More Years, No Degree:295 Widowed :425 9th Grade to 12th Grade, No Diploma :168 Separated :393 Associate's Degree :150 Never Married:376 Master's Degree :141 (Other) :392 </code></pre> <p>Now we can run a linear regression:</p> <pre class="prettyprint"><code>> lm(Age ~ Education + Gender + Employment + Marital, data = impute.3) Call: lm(formula = Age ~ Education + Gender + Employment + Marital, data = impute.3) Coefficients: (Intercept) EducationNursery School to 8th Grade 51.6733 1.4100 Education9th Grade to 12th Grade, No Diploma EducationRegular High School Diploma 1.3675 0.7611 EducationGED or Alternative Credential EducationSome College, Less than 1 Year 1.0365 -2.6069 EducationSome College, 1 or More Years, No Degree EducationAssociate's Degree 0.3563 0.9506 EducationBachelor's Degree EducationMaster's Degree 1.2505 -1.6372 EducationProfessional School Degree EducationDoctorate Degree 1.1774 0.4936 GenderF EmploymentPart Time -0.3190 1.1316 EmploymentUnemployed EmploymentRetired 3.1622 -0.6855 EmploymentStudent MaritalDivorced 3.0850 0.2934 MaritalWidowed MaritalSeparated 2.3162 1.6833 MaritalNever Married 1.6169 </code></pre> <h3>MCMCRegress</h3> <pre class="prettyprint"><code>library(MCMCpack) # b0 = prior mean, B0 = prior precision = 1/variance fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE) summary(fitBayes) </code></pre> <h3>Console Output</h3> <pre class="prettyprint"><code>> fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE) > summary(fitBayes) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) 48.67377 2.5337 0.025337 0.025337 EducationNursery School to 8th Grade 3.77088 3.0514 0.030514 0.030514 Education9th Grade to 12th Grade, No Diploma 3.81009 2.7794 0.027794 0.027794 EducationRegular High School Diploma 3.24531 2.4933 0.024933 0.025412 EducationGED or Alternative Credential 3.38733 3.2155 0.032155 0.032155 EducationSome College, Less than 1 Year -0.08419 2.9104 0.029104 0.029577 EducationSome College, 1 or More Years, No Degree 2.82889 2.6092 0.026092 0.026092 EducationAssociate's Degree 3.32932 2.8410 0.028410 0.028410 EducationBachelor's Degree 3.72272 2.5228 0.025228 0.025659 EducationMaster's Degree 0.87738 2.8611 0.028611 0.028611 EducationProfessional School Degree 3.27542 4.0199 0.040199 0.040199 EducationDoctorate Degree 2.43794 4.5996 0.045996 0.045996 GenderF -0.11321 0.9327 0.009327 0.009327 EmploymentPart Time 1.25556 1.5756 0.015756 0.016170 EmploymentUnemployed 3.27395 1.6213 0.016213 0.015708 EmploymentRetired -0.52614 1.6394 0.016394 0.016394 EmploymentStudent 3.17027 1.6058 0.016058 0.016889 MaritalDivorced 0.72379 1.4715 0.014715 0.014715 MaritalWidowed 2.73130 1.4394 0.014394 0.014706 MaritalSeparated 2.10423 1.4608 0.014608 0.014608 MaritalNever Married 2.00781 1.4960 0.014960 0.014960 sigma2 448.01488 14.0715 0.140715 0.140715 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) 43.75477 46.9556 48.6619 50.3967 53.609 EducationNursery School to 8th Grade -2.19290 1.7079 3.7701 5.8216 9.718 Education9th Grade to 12th Grade, No Diploma -1.59323 1.9586 3.8326 5.6676 9.349 EducationRegular High School Diploma -1.61001 1.5641 3.2474 4.9296 8.155 EducationGED or Alternative Credential -2.88523 1.2095 3.4173 5.5405 9.691 EducationSome College, Less than 1 Year -5.75364 -2.0617 -0.1009 1.8986 5.614 EducationSome College, 1 or More Years, No Degree -2.28754 1.0853 2.8608 4.5718 7.895 EducationAssociate's Degree -2.27611 1.4311 3.3285 5.2330 8.978 EducationBachelor's Degree -1.21780 2.0258 3.7275 5.4203 8.655 EducationMaster's Degree -4.61270 -1.0872 0.8601 2.8484 6.456 EducationProfessional School Degree -4.63027 0.5900 3.2767 5.9475 11.059 EducationDoctorate Degree -6.47767 -0.6371 2.4553 5.4188 11.705 GenderF -1.95673 -0.7298 -0.1067 0.4903 1.727 EmploymentPart Time -1.82784 0.1849 1.2597 2.3160 4.354 EmploymentUnemployed 0.09335 2.1988 3.2674 4.3557 6.433 EmploymentRetired -3.80162 -1.6316 -0.5147 0.5953 2.706 EmploymentStudent 0.03387 2.0713 3.1502 4.2227 6.342 MaritalDivorced -2.15073 -0.2732 0.7249 1.7266 3.602 MaritalWidowed -0.13488 1.7817 2.7367 3.6961 5.567 MaritalSeparated -0.76396 1.1177 2.1118 3.0700 5.001 MaritalNever Married -0.92230 0.9950 1.9976 3.0248 4.898 sigma2 420.98019 438.4621 447.7222 457.2730 476.481 </code></pre> <p>Hopefully, the above observations point in the right direction.</p> <h3>Citations:</h3> <ul> <li>R Package: Mice - Multivariate Imputation by Chained Equations Reference Manual By: Stef van Buuren</li> <li> Flexible Imputation of Missing Data By Stef van Buuren (<strong><em>Online Book</em></strong>)</li> <li> Practical Predictive Analytics by: Ralph Winters </li> <li> Simulation for Data Science with R by: Matthias Templ</li> <li> Bayesian Data Analysis, Third Edition, 3rd Edition By: Andrew Gelman; John B. Carlin; Hal S. Stern; David B. Dunson; Aki Vehtari; Donald B. Rubin</li> </ul>

Multiple Imputation of missing and censored data in R

Tags:

r

missing-data

imputation

I have a dataset with both missing-at-random (MAR) and censored data. The variables are correlated and I am trying to impute the missing data conditionally so that I can estimate the distribution parameters for a correlated multivariate normal distribution. I would like to use Gibbs MCMC, but am having difficulty implementing the procedure. My dataframe has 5 variables (denoted x1:x5), 1099 samples which contain some combination of MAR, censored and observed values. This is what I have tried so far:

# packages
library(msm, tmvtnorm, MCMCpack)

# priors 
theta0<-c(rep(0, 5))
Sigma0<-S0<-diag(5)  
nu0<-4 

# initialize parameters
theta<-c(rep(0, 5))
Tau<-diag(5) 

# initialize output matrix
n_samples <- 1000
mu_MCMC <- matrix(0, nrow = n_samples, ncol = 5)
mu_MCMC[1,] <- theta
cov_MCMC <- matrix(0, nrow = n_samples, ncol = 25)
cov_MCMC[1,] <- c(diag(5))

# detection limits
det_lim <- matrix(c(-1.7, 0, 0, 0, 0), nrow = 1, ncol = 5)

# function to detect NaN (i.e., below detection data)
is.nan.data.frame <- function(x)
    do.call(cbind, lapply(x, is.nan))

for(i in 2:n_samples){
    imputedDF <- data.frame()
    for(r in 1:nrow(originalDF)){
        # variables that are MAR or censored
        mis <- r[, is.na(r) & is.nan(r)]    
        # variables that are observed
        obs <- r[, !is.na(r)]

        # subset mu for missing, observed
        mu1 <- mu[, names(r) %in% names(mis)]
        mu2 <- mu[, names(r) %in% names(obs)]

        # calculate sigmas for MVN partitions of mis, obs
        sigma11 <- sigma[names(r) %in% names(mis), names(r) %in% names(mis)]
        sigma22 <- sigma[names(r) %in% names(obs), names(r) %in% names(obs)]
        sigma12 <- sigma[names(r) %in% names(obs), names(r) %in% names(mis)]
        sigma21 <- t(sigma12)

        # create matrix for detection limits based on missing values
        ## if NaN, use detection limit; if NA use Inf
        dl <- c(ifelse("x1" %in% names(is.nan(r)), det_lim[1, "x1"], Inf), 
                ifelse("x2" %in% names(is.nan(r)), det_lim[1, "x2"], Inf), 
                ifelse("x3" %in% names(is.nan(r)), det_lim[1, "x3"], Inf), 
                ifelse("x4" %in% names(is.nan(r)), det_lim[1, "x4"], Inf), 
                ifelse("x5" %in% names(is.nan(r)), det_lim[1, "x5"], Inf))

        # compute mu, sigma to use for conditional MVN
        ## if all values are missing
        if(length(names(obs) == 0) {
            mu_mis <- mu1
            sigma_mis <- sigma11
        ## otherwise
            } else {
                mu_mis <- mu1 + sigma12 %*% solve(sigma22) * (obs - t(mu2))
                sigma_mis <- sigma11 - sigma12 %*% solve(sigma22) %*% sigma21
        }

        # imputation
        ## if all data are observed, missing is empty
        if(length(obs) == 0) {
            mis_impute <- data.frame()
        ## only need to impute a single value
            } else if(length(names(mis)) == 1) {       
                  mis_impute <- rtnorm(1, mean = mu_mis, sd = sigma_mis, lower = -Inf, upper = dl)
        ## have more than one missing value         
                  } else {
                      mis_impute <- rtmvnorm(1, mean = mu_mis, sigma = sigma_mis, lower = rep(-Inf, length = length(names(mis))), upper = dl)
        }

       # merge observed values with simulated 
       ## if all values observed   
       if(length(names(mis)) == 0) {
           sim_result <- obs
           } else {
                 sim_result <- cbind(mis_impute, obs) 
       }

       imputedDF <- rbind(imputedDF, sim_result)
    }

    # update theta
    v <- solve(solve(Sigma0) + nrow(sim_result)*Tau)
    m <- v %*% (solve(Sigma0) %*% theta0 + Tau %*% apply(sim_result,2,sum))
    mu <- as.data.frame(rmvnorm(1,m,v))
    mu_MCMC[i,] <- mu

    # update Sigma
    tmp <- t(sim_result) - mu
    Tau <- rwish(nu0 + nrow(sim_result), solve(S0 + t(tmp) %*% tmp)) 
    sigma <- matrix(c(solve(Tau)), nrow = 5, ncol = 5, byrow = TRUE)
    cov_MCMC[i,] <- c(solve(Tau))
}

I keep running into errors because the imputation returns NaN and NA values, but I can't figure out what is going wrong because when I test it just using the inner loop to impute the data, it seems to work. Thus, the issue seems to be the parameter updating but I can't figure it out!

589

asked May 07 '17 03:05

chelsea

1 Answers

Preamble:

My sense is that part of the problem here is we do not have a good example dataset to work off.

My feeling is we can address this by creating an example dataset to frame the solution discussion. A useful package to this end is the Wakefield package that allows for the creation of simulated datasets.

We might, for example, create a dataset of 2000 people, where some of the ages, gender, employment status, education data, and marital status information is missing.

Imputation

The core question is can we impute the age or gender from other data in the data set?

For example, if we do not know someone's age, can we impute it from their marital status, employment type and or their Education level? At a very simplistic level, we might simply search for entries with NA for age, and look at Marital status. If the marital status is "married", then we impute that our data set is for American's and look up at the average age for marriage and replace with an estimated age for a married person.

We can expand on this and make our estimates more accurate by taking into account more variables. For example, we might look at both Marital Status, Education level and Employment status to further improve our age estimate. If a person is married, with a Ph.D., and retired we push the age upwards. If a person is single, a student we push the age lower. Further to this, we can look at the distribution of the ages in the data set to impute data about missing values.

Generate an Example Data Set.

# packages

requiredPackages <- c("wakefield", "dplyr", "BaylorEdPsych", "mice", "MCMCpack")

ipak <- function(pkg) {
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) {
    install.packages(new.pkg, dependencies = TRUE)
  }
  sapply(pkg, require, character.only = TRUE)
}

ipak(requiredPackages)

# generate some data for Males with a 8% missing value for
# age

set.seed(10)
f.df <- r_data_frame(n = 1000, age,
                    gender(x = c("M", "F"), prob = c(0, 1), name = "Gender"),
                    employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"),
                    education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma",
                                    "Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree",
                                    "Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"),
                                    prob = c(0.013, 0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072, 0.019, 0.012), name = "Education"),
                    marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.05)

# str(f.df)
summary(f.df)
set.seed(20)

# generate some data for Females with a 5% missing value for
# age

m.df <- r_data_frame(n = 1000, age,
                     gender(x = c("M", "F"), prob = c(1, 0), name = "Gender"),
                     employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"),
                     education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma",
                                      "Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree",
                                      "Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"),
                                      prob = c(0.013,0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072,0.019, 0.012), name = "Education"),
                     marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.03)

summary(m.df)

all.df = rbind.data.frame(m.df, f.df)

summary(all.df)

Data Summary

> summary(all.df)
      Age        Gender        Employment                                      Education            Marital   
 Min.   :18.00   M:1000   Full Time :1142   Regular High School Diploma             :459   Married      :394  
 1st Qu.:35.00   F:1000   Part Time : 207   Bachelor's Degree                       :356   Divorced     :378  
 Median :54.00            Unemployed: 193   Some College, 1 or More Years, No Degree:284   Widowed      :411  
 Mean   :53.76            Retired   : 182   9th Grade to 12th Grade, No Diploma     :156   Separated    :379  
 3rd Qu.:72.00            Student   : 196   Associate's Degree                      :145   Never Married:358  
 Max.   :89.00            NA's      :  80   (Other)                                 :520   NA's         : 80  
 NA's   :80                                 NA's                                    : 80                      
>

Is data Missing completely at Random or Not Missing at Random?

# Test for MCAR - Missing at Completely at Random...
test_mcar <- LittleMCAR(all.df)
print(test_mcar$amount.missing)
print(test_mcar$p.value)

Console Output

> # Test for MCAR - Missing at Completely at Random...
> test_mcar <- LittleMCAR(all.df)
this could take a while> print(test_mcar$amount.missing)
                  Age Gender Employment Education Marital
Number Missing  80.00      0      80.00     80.00   80.00
Percent Missing  0.04      0       0.04      0.04    0.04
> print(test_mcar$p.value)
[1] 0.02661428

Imputation of Data

Ok, let us first look at the distribution of missing values. We can run mice::md.pattern() function, to show the distribution of the missing values over the other columns in the dataframe. The md.pattern() function output is useful for suggesting which variables might be good candidates to use for imputing the missing values:

> md.pattern(all.df)
     Gender Age Employment Education Marital    
1696      1   1          1         1       1   0
73        1   1          1         1       0   1
73        1   1          1         0       1   1
2         1   1          1         0       0   2
71        1   1          0         1       1   1
3         1   1          0         1       0   2
2         1   1          0         0       1   2
71        1   0          1         1       1   1
2         1   0          1         1       0   2
3         1   0          1         0       1   2
4         1   0          0         1       1   2
          0  80         80        80      80 320

Ok, from this we can now move to impute the missing values:

imp <- mice(all.df, m = 5, maxit = 50, seed = 1234, printFlag = FALSE)

The m=5 parameter specifies that you end up with five plausible imputations for the variable
The maxit=50 parameter specifies that there will be up to 50 iterations of the algorithm before it converges to a solution and this can be adjusted upward or downward to the desired precision

The mice() function may take a while depending upon the number of iterations we specify. In this case, upon completion we can see some of the imputed values for Age using head() function:

head(imp$imp$Age)
     1  2  3  4  5
7   28 49 37 70 89
33  55 54 52 88 24
56  89 83 68 71 61
84  43 43 24 30 31
96  28 64 89 41 50
120 47 34 36 22 77

To actually complete the imputation, we have to run the complete() function and assign the results to a new dataframe. This version of complete() function will collect all imputations in the assigned dataframe via the "long" parameter:

all_imputed_df <- complete(imp, "long", include = TRUE)
table(all_imputed_df$.imp, is.na(all_imputed_df$Age))

Console:

> all_imputed_df <- complete(imp, "long", include = TRUE)
> table(all_imputed_df$.imp, is.na(all_imputed_df$Age))

    FALSE TRUE
  0  1920   80
  1  2000    0
  2  2000    0
  3  2000    0
  4  2000    0
  5  2000    0

Now we have a dataset of 12000 age values, across 5 age inputted values.

Let's try a regression with imputation #3.

First, extract impute #3

impute.3 <- subset(all_imputed_df,.imp=='3')  
summary(impute.3)

Console:

> impute.3 <- subset(all_imputed_df, .imp == "3")
> summary(impute.3)
      .imp        .id              Age        Gender        Employment  
 Min.   :3   Min.   :   1.0   Min.   :18.00   M:1000   Full Time :1192  
 1st Qu.:3   1st Qu.: 500.8   1st Qu.:35.00   F:1000   Part Time : 211  
 Median :3   Median :1000.5   Median :54.00            Unemployed: 202  
 Mean   :3   Mean   :1000.5   Mean   :53.89            Retired   : 191  
 3rd Qu.:3   3rd Qu.:1500.2   3rd Qu.:72.00            Student   : 204  
 Max.   :3   Max.   :2000.0   Max.   :89.00                             

                                    Education            Marital   
 Regular High School Diploma             :478   Married      :416  
 Bachelor's Degree                       :376   Divorced     :390  
 Some College, 1 or More Years, No Degree:295   Widowed      :425  
 9th Grade to 12th Grade, No Diploma     :168   Separated    :393  
 Associate's Degree                      :150   Never Married:376  
 Master's Degree                         :141                      
 (Other)                                 :392

Now we can run a linear regression:

> lm(Age ~ Education + Gender + Employment + Marital, data = impute.3)

Call:
lm(formula = Age ~ Education + Gender + Employment + Marital, 
    data = impute.3)

Coefficients:
                                      (Intercept)               EducationNursery School to 8th Grade  
                                          51.6733                                             1.4100  
     Education9th Grade to 12th Grade, No Diploma               EducationRegular High School Diploma  
                                           1.3675                                             0.7611  
           EducationGED or Alternative Credential            EducationSome College, Less than 1 Year  
                                           1.0365                                            -2.6069  
EducationSome College, 1 or More Years, No Degree                        EducationAssociate's Degree  
                                           0.3563                                             0.9506  
                       EducationBachelor's Degree                           EducationMaster's Degree  
                                           1.2505                                            -1.6372  
              EducationProfessional School Degree                          EducationDoctorate Degree  
                                           1.1774                                             0.4936  
                                          GenderF                                EmploymentPart Time  
                                          -0.3190                                             1.1316  
                             EmploymentUnemployed                                  EmploymentRetired  
                                           3.1622                                            -0.6855  
                                EmploymentStudent                                    MaritalDivorced  
                                           3.0850                                             0.2934  
                                   MaritalWidowed                                   MaritalSeparated  
                                           2.3162                                             1.6833  
                             MaritalNever Married  
                                           1.6169

MCMCRegress

library(MCMCpack) # b0 = prior mean, B0 = prior precision = 1/variance
fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE)
summary(fitBayes)

Console Output

> fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE)
> summary(fitBayes)

Iterations = 1001:11000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 10000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                                                       Mean      SD Naive SE Time-series SE
(Intercept)                                        48.67377  2.5337 0.025337       0.025337
EducationNursery School to 8th Grade                3.77088  3.0514 0.030514       0.030514
Education9th Grade to 12th Grade, No Diploma        3.81009  2.7794 0.027794       0.027794
EducationRegular High School Diploma                3.24531  2.4933 0.024933       0.025412
EducationGED or Alternative Credential              3.38733  3.2155 0.032155       0.032155
EducationSome College, Less than 1 Year            -0.08419  2.9104 0.029104       0.029577
EducationSome College, 1 or More Years, No Degree   2.82889  2.6092 0.026092       0.026092
EducationAssociate's Degree                         3.32932  2.8410 0.028410       0.028410
EducationBachelor's Degree                          3.72272  2.5228 0.025228       0.025659
EducationMaster's Degree                            0.87738  2.8611 0.028611       0.028611
EducationProfessional School Degree                 3.27542  4.0199 0.040199       0.040199
EducationDoctorate Degree                           2.43794  4.5996 0.045996       0.045996
GenderF                                            -0.11321  0.9327 0.009327       0.009327
EmploymentPart Time                                 1.25556  1.5756 0.015756       0.016170
EmploymentUnemployed                                3.27395  1.6213 0.016213       0.015708
EmploymentRetired                                  -0.52614  1.6394 0.016394       0.016394
EmploymentStudent                                   3.17027  1.6058 0.016058       0.016889
MaritalDivorced                                     0.72379  1.4715 0.014715       0.014715
MaritalWidowed                                      2.73130  1.4394 0.014394       0.014706
MaritalSeparated                                    2.10423  1.4608 0.014608       0.014608
MaritalNever Married                                2.00781  1.4960 0.014960       0.014960
sigma2                                            448.01488 14.0715 0.140715       0.140715

2. Quantiles for each variable:

                                                       2.5%      25%      50%      75%   97.5%
(Intercept)                                        43.75477  46.9556  48.6619  50.3967  53.609
EducationNursery School to 8th Grade               -2.19290   1.7079   3.7701   5.8216   9.718
Education9th Grade to 12th Grade, No Diploma       -1.59323   1.9586   3.8326   5.6676   9.349
EducationRegular High School Diploma               -1.61001   1.5641   3.2474   4.9296   8.155
EducationGED or Alternative Credential             -2.88523   1.2095   3.4173   5.5405   9.691
EducationSome College, Less than 1 Year            -5.75364  -2.0617  -0.1009   1.8986   5.614
EducationSome College, 1 or More Years, No Degree  -2.28754   1.0853   2.8608   4.5718   7.895
EducationAssociate's Degree                        -2.27611   1.4311   3.3285   5.2330   8.978
EducationBachelor's Degree                         -1.21780   2.0258   3.7275   5.4203   8.655
EducationMaster's Degree                           -4.61270  -1.0872   0.8601   2.8484   6.456
EducationProfessional School Degree                -4.63027   0.5900   3.2767   5.9475  11.059
EducationDoctorate Degree                          -6.47767  -0.6371   2.4553   5.4188  11.705
GenderF                                            -1.95673  -0.7298  -0.1067   0.4903   1.727
EmploymentPart Time                                -1.82784   0.1849   1.2597   2.3160   4.354
EmploymentUnemployed                                0.09335   2.1988   3.2674   4.3557   6.433
EmploymentRetired                                  -3.80162  -1.6316  -0.5147   0.5953   2.706
EmploymentStudent                                   0.03387   2.0713   3.1502   4.2227   6.342
MaritalDivorced                                    -2.15073  -0.2732   0.7249   1.7266   3.602
MaritalWidowed                                     -0.13488   1.7817   2.7367   3.6961   5.567
MaritalSeparated                                   -0.76396   1.1177   2.1118   3.0700   5.001
MaritalNever Married                               -0.92230   0.9950   1.9976   3.0248   4.898
sigma2                                            420.98019 438.4621 447.7222 457.2730 476.481

Hopefully, the above observations point in the right direction.

Citations:

R Package: Mice - Multivariate Imputation by Chained Equations Reference Manual By: Stef van Buuren
Flexible Imputation of Missing Data By Stef van Buuren (Online Book)
Practical Predictive Analytics by: Ralph Winters
Simulation for Data Science with R by: Matthias Templ
Bayesian Data Analysis, Third Edition, 3rd Edition By: Andrew Gelman; John B. Carlin; Hal S. Stern; David B. Dunson; Aki Vehtari; Donald B. Rubin

139

answered Nov 02 '22 20:11

Technophobe01

Related questions
                            
                                Figure name in caption using RMarkdown
                            
                                creating sequence of dates for each group in r
                            
                                How to prep transaction data into basket for arules
                            
                                Sankey diagram in R
                            
                                Setting NA in a matrix using another logical matrix
                            
                                horizontal ggplot2::geom_violin without coord_flip
                            
                                Creating internal functions (can't be called from console) in R
                            
                                Why does data.table get copied when adding a new column?
                            
                                XGBoost - Poisson distribution with varying exposure / offset
                            
                                Differing quantiles: Boxplot vs. Violinplot
                            
                                Draw curved lines in ggmap, geom_curve not working
                            
                                R, R6 Operator Overloading
                            
                                Stochastic gradient descent from gradient descent implementation in R
                            
                                How to make label and box align next to each other in shiny::numericInput?
                            
                                Drawing decision boundaries in R
                            
                                cv.glmnet fails for ridge, not lasso, for simulated data with coder error
                            
                                Error: processing vignette failed with diagnostics: 4 simultaneous processes spawned
                            
                                Crop out ggplot2 whitespace around plot
                            
                                What is the difference between list.files and dir?
                            
                                Export Graphics with exact size and font size

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multiple Imputation of missing and censored data in R

Tags:

r

missing-data

imputation

chelsea

People also ask

1 Answers

Preamble:

Imputation

Generate an Example Data Set.

Data Summary

Is data Missing completely at Random or Not Missing at Random?

Console Output

Imputation of Data

Console:

Console:

MCMCRegress

Console Output

Citations:

Technophobe01

Recent Activity

Donate For Us