Consider the following (excel) dataset:
m | r
----|------
2.0 | 3.3
0.8 |
| 4.0
1.3 |
2.1 | 5.2
| 2.3
| 1.9
2.5 |
1.2 | 3.0
2.0 | 2.6
My goal is to fill in missing values using the following condition:
Denote as R the pairwise correlation between the above two columns (around 0.68). Denote as R* the correlation after the empty cells have been filled in. Fill in the table so that (R - R*)^2 = 0. This is, I want to keep the correlation structure of the data intact.
So far I have done it using Matlab:
clear all;
m = xlsread('data.xlsx','A2:A11') ;
r = xlsread('data.xlsx','B2:B11') ;
rho = corr(m,r,'rows','pairwise');
x0 = [1,1,1,1,1,1];
lb = [0,0,0,0,0,0];
f = @(x)my_correl(x,rho);
SOL = fmincon(f,x0,[],[],[],[],lb)
where the function my_correl
is:
function X = my_correl(x,rho)
sum_m = (11.9 + x(1) + x(2) + x(3));
sum_r = (22.3 + x(1) + x(2) + x(3));
avg_m = (11.9 + x(1) + x(2) + x(3))/8;
avg_r = (22.3 + x(4) + x(5) + x(6))/8;
rho_num = 8*(26.32 + 4*x(1) + 2.3*x(2) + 1.9*x(3) + 0.8*x(4) + 1.3*x(5) + 2.5*x(6)) - sum_m*sum_r;
rho_den = sqrt(8*(22.43 + (4*x(1))^2 + (2.3*x(2))^2 + (1.9*x(3))^2) - sum_m^2)*sqrt(8*(78.6 + (0.8*x(4))^2 + (1.3*x(5))^ + (2.5*x(6))^2) - sum_r^2);
X = (rho - rho_num/rho_den)^2;
end
This function computes the correlation manually, where every missing data is a variable x(i)
.
The problem: my actual dataset has more than 20,000 observations. There is no way I can create that rho formula manually.
How can I fill in my dataset?
Note 1: I am open to use alternative languages like Python, Julia, or R. Matlab it's just my default one.
Note 2: a 100 points bounty will be awarded to the answer. Promise from now.
Mean imputation. Perhaps the easiest way to impute is to replace each missing value with the mean of the observed values for that variable.
If some data are missing, it is not possible to assess the correlation in the usual way.
The simplest imputation method is replacing missing values with the mean or median values of the dataset at large, or some similar summary statistic. This has the advantage of being the simplest possible approach, and one that doesn't introduce any undue bias into the dataset.
Problem #1: Mean imputation does not preserve the relationships among variables. True, imputing the mean preserves the mean of the observed data. So if the data are missing completely at random, the estimate of the mean remains unbiased.
This is how I would approach it, with an implementation in R provided:
There is not a unique solution for imputing the missing data points, such that the pairwise correlation of the complete (imputed) data is equal to the pairwise correlation of the incomplete data. So to find a 'good' solution rather than just 'any' solution, we can introduce an additional criteria that the complete imputed data should also share the same linear regression with the original data. This leads us to a fairly simple approach.
A solution like this in R:
library(data.table)
set.seed(123)
rho = cor(dt$m,dt$r,'pairwise')
# calculate linear regression of original data
fit1 = lm(r ~ m, data=dt)
fit2 = lm(m ~ r, data=dt)
# extract the standard errors of regression intercept (in each m & r direction)
# and multiply s.e. by sqrt(n) to get standard deviation
sd1 = summary(fit1)$coefficients[1,2] * sqrt(dt[!is.na(r), .N])
sd2 = summary(fit2)$coefficients[1,2] * sqrt(dt[!is.na(m), .N])
# find where data points with missing values lie on the regression line
dt[is.na(r), r.imp := coefficients(fit1)[1] + coefficients(fit1)[2] * m]
dt[is.na(m), m.imp := coefficients(fit2)[1] + coefficients(fit2)[2] * r]
# generate randomised residuals for the missing data, using the s.d. calculated above
dt[is.na(r), r.ran := rnorm(.N, sd=sd1)]
dt[is.na(m), m.ran := rnorm(.N, sd=sd2)]
# function that scales the residuals by a factor x, then calculates how close correlation of imputed data is to that of original data
obj = function(x, dt, rho) {
dt[, r.comp := r][, m.comp := m]
dt[is.na(r), r.comp := r.imp + r.ran*x]
dt[is.na(m), m.comp := m.imp + m.ran*x]
rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
(rho-rho2)^2
}
# find the value of x that minimises the discrepencay of imputed versus original correlation
fit = optimize(obj, c(-5,5), dt, rho)
x=fit$minimum
dt[, r.comp := r][, m.comp := m]
dt[is.na(r), r.comp := r.imp + r.ran*x]
dt[is.na(m), m.comp := m.imp + m.ran*x]
rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
(rho-rho2)^2 # check that rho2 is approximately equal to rho
As a final check, calculate linear regression of the complete imputed data and plot to show that regression line is same as for original data. Note that the plot below was for the large data set shown below, to demonstrate use of this method on large data.
fit.comp = lm(r.comp ~ m.comp, data=dt)
plot(dt$m.comp, dt$r.comp)
points(dt$m, dt$r, col="red")
abline(fit1, col="green")
abline(fit.comp, col="blue")
mtext(paste(" Rho =", round(rho,5)), at=-1)
mtext(paste(" Rho2 =", round(rho2, 5)), at=6)
DATA
original toy data from OP example:
dt=structure(list(m = c(2, 0.8, NA, 1.3, 2.1, NA, NA, 2.5, 1.2, 2),
r = c(3.3, NA, 4, NA, 5.2, 2.3, 1.9, NA, 3, 2.6)),
.Names = c("m", "r"), row.names = c(NA, -10L),
class = c("data.table", "data.frame"))
A larger data set to demonstrate on big data
dt = data.table(m=rnorm(1e5, 3, 2))[, r:=1.5 + 1.1*m + rnorm(1e5,0,2)]
dt[sample(.N, 3e4), m:=NA]
dt[sample(which(!is.na(m)), 3e4), r := NA]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With