Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impute missing data, while forcing correlation coefficient to remain the same

Consider the following (excel) dataset:

m   |   r
----|------
2.0 | 3.3
0.8 |   
    | 4.0
1.3 |   
2.1 | 5.2
    | 2.3
    | 1.9
2.5 | 
1.2 | 3.0
2.0 | 2.6

My goal is to fill in missing values using the following condition:

Denote as R the pairwise correlation between the above two columns (around 0.68). Denote as R* the correlation after the empty cells have been filled in. Fill in the table so that (R - R*)^2 = 0. This is, I want to keep the correlation structure of the data intact.

So far I have done it using Matlab:

clear all;

m = xlsread('data.xlsx','A2:A11') ;
r = xlsread('data.xlsx','B2:B11') ;

rho = corr(m,r,'rows','pairwise');

x0 = [1,1,1,1,1,1];
lb = [0,0,0,0,0,0];
f = @(x)my_correl(x,rho);

SOL = fmincon(f,x0,[],[],[],[],lb)

where the function my_correl is:

function X = my_correl(x,rho)

sum_m = (11.9 + x(1) + x(2) + x(3));
sum_r = (22.3 + x(1) + x(2) + x(3));
avg_m = (11.9 + x(1) + x(2) + x(3))/8;
avg_r = (22.3 + x(4) + x(5) + x(6))/8;
rho_num = 8*(26.32 + 4*x(1) + 2.3*x(2) + 1.9*x(3) + 0.8*x(4) + 1.3*x(5) + 2.5*x(6)) - sum_m*sum_r;
rho_den = sqrt(8*(22.43 + (4*x(1))^2 + (2.3*x(2))^2 + (1.9*x(3))^2) - sum_m^2)*sqrt(8*(78.6 + (0.8*x(4))^2 + (1.3*x(5))^ + (2.5*x(6))^2) - sum_r^2);

X = (rho - rho_num/rho_den)^2;

end

This function computes the correlation manually, where every missing data is a variable x(i).

The problem: my actual dataset has more than 20,000 observations. There is no way I can create that rho formula manually.

How can I fill in my dataset?

Note 1: I am open to use alternative languages like Python, Julia, or R. Matlab it's just my default one.

Note 2: a 100 points bounty will be awarded to the answer. Promise from now.

like image 637
luchonacho Avatar asked Sep 21 '16 13:09

luchonacho


People also ask

What is the best way to impute missing data?

Mean imputation. Perhaps the easiest way to impute is to replace each missing value with the mean of the observed values for that variable.

Can you do a correlation with missing data?

If some data are missing, it is not possible to assess the correlation in the usual way.

What is the best imputation method you would consider for replacing missing values?

The simplest imputation method is replacing missing values with the mean or median values of the dataset at large, or some similar summary statistic. This has the advantage of being the simplest possible approach, and one that doesn't introduce any undue bias into the dataset.

What are the flaws of imputing missing values with mean?

Problem #1: Mean imputation does not preserve the relationships among variables. True, imputing the mean preserves the mean of the observed data. So if the data are missing completely at random, the estimate of the mean remains unbiased.


1 Answers

This is how I would approach it, with an implementation in R provided:

There is not a unique solution for imputing the missing data points, such that the pairwise correlation of the complete (imputed) data is equal to the pairwise correlation of the incomplete data. So to find a 'good' solution rather than just 'any' solution, we can introduce an additional criteria that the complete imputed data should also share the same linear regression with the original data. This leads us to a fairly simple approach.

  1. calculate a linear regression model for the original data.
  2. find the imputed values for missing values that would lie exactly on this regression line.
  3. generate a random scatter of residuals for the imputed values around this regression line
  4. scale the imputed residuals to force the correlation of the complete imputed data to be equal to that of the original data

A solution like this in R:

library(data.table)
set.seed(123)

rho = cor(dt$m,dt$r,'pairwise')

# calculate linear regression of original data
fit1 = lm(r ~ m, data=dt)
fit2 = lm(m ~ r, data=dt)
# extract the standard errors of regression intercept (in each m & r direction)
# and multiply s.e. by sqrt(n) to get standard deviation 
sd1 = summary(fit1)$coefficients[1,2] * sqrt(dt[!is.na(r), .N])
sd2 = summary(fit2)$coefficients[1,2] * sqrt(dt[!is.na(m), .N])

# find where data points with missing values lie on the regression line
dt[is.na(r), r.imp := coefficients(fit1)[1] + coefficients(fit1)[2] * m] 
dt[is.na(m), m.imp := coefficients(fit2)[1] + coefficients(fit2)[2] * r]

# generate randomised residuals for the missing data, using the s.d. calculated above
dt[is.na(r), r.ran := rnorm(.N, sd=sd1)] 
dt[is.na(m), m.ran := rnorm(.N, sd=sd2)] 

# function that scales the residuals by a factor x, then calculates how close correlation of imputed data is to that of original data
obj = function(x, dt, rho) {
  dt[, r.comp := r][, m.comp := m]
  dt[is.na(r), r.comp := r.imp + r.ran*x] 
  dt[is.na(m), m.comp := m.imp + m.ran*x] 
  rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
  (rho-rho2)^2
}

# find the value of x that minimises the discrepencay of imputed versus original correlation
fit = optimize(obj, c(-5,5), dt, rho)

x=fit$minimum
dt[, r.comp := r][, m.comp := m]
dt[is.na(r), r.comp := r.imp + r.ran*x] 
dt[is.na(m), m.comp := m.imp + m.ran*x] 
rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
(rho-rho2)^2  # check that rho2 is approximately equal to rho

As a final check, calculate linear regression of the complete imputed data and plot to show that regression line is same as for original data. Note that the plot below was for the large data set shown below, to demonstrate use of this method on large data.

fit.comp = lm(r.comp ~ m.comp, data=dt)
plot(dt$m.comp, dt$r.comp)
points(dt$m, dt$r, col="red")
abline(fit1, col="green")
abline(fit.comp, col="blue")
mtext(paste(" Rho =", round(rho,5)), at=-1)
mtext(paste(" Rho2 =", round(rho2, 5)), at=6)

enter image description here

DATA

original toy data from OP example:

dt=structure(list(m = c(2, 0.8, NA, 1.3, 2.1, NA, NA, 2.5, 1.2, 2), 
                  r = c(3.3, NA, 4, NA, 5.2, 2.3, 1.9, NA, 3, 2.6)), 
             .Names = c("m", "r"), row.names = c(NA, -10L), 
             class = c("data.table", "data.frame"))

A larger data set to demonstrate on big data

dt = data.table(m=rnorm(1e5, 3, 2))[, r:=1.5 + 1.1*m + rnorm(1e5,0,2)]
dt[sample(.N, 3e4), m:=NA]
dt[sample(which(!is.na(m)), 3e4), r := NA]
like image 74
dww Avatar answered Sep 21 '22 16:09

dww