Remove outliers from correlation coefficient calculation

1 Answers

If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:

Firstly, generate some dummy data:

require(MASS) ## for mvrnorm()
set.seed(1)
dat <- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2))
dat <- data.frame(dat)
names(dat) <- c("X","Y")
plot(dat)

Next, we fit the linear model and extract the residuals:

res <- resid(mod <- lm(Y ~ X, data = dat))

The quantile() function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:

res.qt <- quantile(res, probs = c(0.05,0.95))

Select those observations with residuals in the middle 90% of the data:

want <- which(res >= res.qt[1] & res <= res.qt[2])

We can then visualise this, with the red points being those we will retain:

plot(dat, type = "n")
points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

The plot produced from the dummy data showing the selected points with the smallest residuals

The correlations for the full data and the selected subset are:

> cor(dat)
          X         Y
X 1.0000000 0.8935235
Y 0.8935235 1.0000000
> cor(dat[want,])
          X         Y
X 1.0000000 0.9272109
Y 0.9272109 1.0000000
> cor(dat[-want,])
         X        Y
X 1.000000 0.739972
Y 0.739972 1.000000

Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest absolute residuals:

ares <- abs(res)
absres.qt <- quantile(ares, prob = c(.9))
abswant <- which(ares <= absres.qt)
## plot - virtually the same, but not quite
plot(dat, type = "n")
points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

With this slightly different subset, the correlation is slightly lower:

> cor(dat[abswant,])
          X         Y
X 1.0000000 0.9272032
Y 0.9272032 1.0000000

Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. Wikipedia has info on Cook's distance and proposed thresholds. The cooks.distance() function can be used to retrieve the values from mod:

> head(cooks.distance(mod))
           1            2            3            4            5            6 
7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03

and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:

> any(cooks.distance(mod) > 1)
[1] FALSE
> any(cooks.distance(mod) > (4 * nrow(dat)))
[1] FALSE

none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)

Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.

110

answered Sep 21 '22 14:09

Gavin Simpson

Related questions
                            
                                R error "could not find function 'multiplot' " using Cookbook example
                            
                                Find which interval row in a data frame that each element of a vector belongs in
                            
                                Splitting String based on letters case
                            
                                What is the difference between these two comparisons? [duplicate]
                            
                                Implementation of skyline query or efficient frontier
                            
                                R - count all combinations
                            
                                How to interpret lm() coefficient estimates when using bs() function for splines
                            
                                Public Amazon EC2 AMIs with R pre-installed
                            
                                package doMC NOT available for R version 3.0.0 warning in install.packages
                            
                                assign a value, if a number is in between two numbers
                            
                                How to check whether R is already installed in Ubuntu? [closed]
                            
                                In R, use lubridate to convert hms objects into seconds
                            
                                How to get top n companies from a data frame in decreasing order
                            
                                writing functions vs. line-by-line interpretation in an R workflow
                            
                                Complement a DNA sequence
                            
                                How to split a data frame by rows, and then process the blocks?
                            
                                R duplicate a matrix several times and then bind by rows together
                            
                                Simulating a Random Walk
                            
                                calculating the outliers in R
                            
                                How to show corpus text in R tm package?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove outliers from correlation coefficient calculation

Tags:

r

statistics

correlation

Leo

People also ask

1 Answers

Gavin Simpson

Recent Activity

Donate For Us