Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove outliers from correlation coefficient calculation

Assume we have two numeric vectors x and y. The Pearson correlation coefficient between x and y is given by

cor(x, y)

How can I automatically consider only a subset of x and y in the calculation (say 90%) as to maximize the correlation coefficient?

like image 482
Leo Avatar asked Jan 12 '11 08:01

Leo


People also ask

How do you calculate outliers in a correlation coefficient?

An outlier will weaken the correlation making the data more scattered so r gets closer to 0. Therefore, if you remove the outlier, the r value will increase (stronger correlation since data will be less scattered).

Should you remove outliers for correlation?

It's best to remove outliers only when you have a sound reason for doing so. Some outliers represent natural variations in the population, and they should be left as is in your dataset. These are called true outliers.

Can you calculate correlation with outliers?

Start of post 2: Correlation when outliers in the data The method most commonly used to estimate the correlation between two datasets is to calculate the correlation coefficient based on the values in the two data sets.. But it is more robust against outliers to calculate it based on the ranks of the data.

What do you do with outliers in a Pearson correlation?

A basic assumption in correlation is that there are no many outliers. In my opinion you can follow two routes: a) Detect outliers using "casewise diagnostics" and delete them (but you have to justified this), or b) Use a non-parametric alternative (Spearman's rank-order correlation or Kendall's Tau Correlation).


1 Answers

If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:

Firstly, generate some dummy data:

require(MASS) ## for mvrnorm()
set.seed(1)
dat <- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2))
dat <- data.frame(dat)
names(dat) <- c("X","Y")
plot(dat)

Next, we fit the linear model and extract the residuals:

res <- resid(mod <- lm(Y ~ X, data = dat))

The quantile() function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:

res.qt <- quantile(res, probs = c(0.05,0.95))

Select those observations with residuals in the middle 90% of the data:

want <- which(res >= res.qt[1] & res <= res.qt[2])

We can then visualise this, with the red points being those we will retain:

plot(dat, type = "n")
points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

The plot produced from the dummy data showing the selected points with the smallest residuals

The correlations for the full data and the selected subset are:

> cor(dat)
          X         Y
X 1.0000000 0.8935235
Y 0.8935235 1.0000000
> cor(dat[want,])
          X         Y
X 1.0000000 0.9272109
Y 0.9272109 1.0000000
> cor(dat[-want,])
         X        Y
X 1.000000 0.739972
Y 0.739972 1.000000

Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest absolute residuals:

ares <- abs(res)
absres.qt <- quantile(ares, prob = c(.9))
abswant <- which(ares <= absres.qt)
## plot - virtually the same, but not quite
plot(dat, type = "n")
points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

With this slightly different subset, the correlation is slightly lower:

> cor(dat[abswant,])
          X         Y
X 1.0000000 0.9272032
Y 0.9272032 1.0000000

Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. Wikipedia has info on Cook's distance and proposed thresholds. The cooks.distance() function can be used to retrieve the values from mod:

> head(cooks.distance(mod))
           1            2            3            4            5            6 
7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03

and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:

> any(cooks.distance(mod) > 1)
[1] FALSE
> any(cooks.distance(mod) > (4 * nrow(dat)))
[1] FALSE

none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)

Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.

like image 110
Gavin Simpson Avatar answered Sep 21 '22 14:09

Gavin Simpson