Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I tell R to remove the outlier from a correlation calculation?

How do I tell R to remove an outlier when calculating correlation? I identified a potential outlier from a scatter plot, and am trying to compare correlation with and without this value. This is for an intro stats course; I am just playing with this data to start understanding correlation and outliers.

My data looks like this:

"Australia" 35.2 31794.13
"Austria" 29.1 33699.6
"Canada" 32.6 33375.5
"CzechRepublic" 25.4 20538.5
"Denmark" 24.7 33972.62
...

and so on, for 26 lines of data. I am trying to find the correlation of the first and second numbers.

I did read this question, however, I am only trying to remove a single point, not a percentage of points. Is there a command in R to do this?

like image 877
Beth Avatar asked Oct 12 '12 03:10

Beth


Video Answer


4 Answers

You can't do that with the basic cor() function but you can

  • use a correlation function from one of the robust statistics packages, eg robCov() from package robust

  • use a winsorize() function, eg from robustHD, to treat your data

Here is a quick example for the 2nd approach:

R> set.seed(42)
R> x <- rnorm(100)
R> y <- rnorm(100)
R> cor(x,y)             # correlation of two unrelated series: almost zero
[1] 0.0312798

The we "contaminate" one point each with a big outlier:

R> x[50] <- y[50] <- 10
R> cor(x,y)             # bigger correlation due to one bad data point
[1] 0.534996

So let's winsorize:

R> x <- robustHD::winsorize(x)
R> y <- robustHD::winsorize(y)
R> cor(x,y)
[1] 0.106519
R> 

and we're back down to a less correlated measure.

like image 192
Dirk Eddelbuettel Avatar answered Oct 28 '22 02:10

Dirk Eddelbuettel


If you apply the same conditional expression for both vectors you could exclude that "point".

cor( DF[2][ DF[2] > 100 ],   # items in 2nd column excluded based on their values
   DF[3][ DF[2] > 100 ] )  # items in 3rd col excluded based on the 2nd col values
like image 41
IRTFM Avatar answered Oct 28 '22 01:10

IRTFM


In the following, I worked from the presumption (that I read between your lines) that you have identified that single outlier visually (ie., from a graph). From your limited data set it's probably easy to identify that point based on its value. If you have more data points, you could use something like this.

tmp <- qqnorm(bi$bias.index)
qqline(bi$bias.index)
(X <- identify(tmp, , labels=rownames(bi)))
qqnorm(bi$bias.index[-X])
qqline(bi$bias.index[-X])

Note that I just copied my own code because I couldn't work from sample code from you. Also check ?identify before.

like image 25
Paul Lemmens Avatar answered Oct 28 '22 01:10

Paul Lemmens


It makes sense to put all your data on a data frame, so it's easier to handle. I always like to keep track of outliers by using an extra column (in this case, B) in my data frame.

df       <-  data.frame(A=c(1,2,3,4,5), B=c(T,T,T,F,T))

And then filter out data I don't want before getting into the good analytical stuff.

myFilter <-  with(df, B==T)
df[myFilter, ]

This way, you don't lose track of the outliers, and you are able to manage them as you see fit.

EDIT:

Improving upon my answer above, you could also use conditionals to define the outliers.

df  <-  data.frame(A=c(1,2,15,1,2))
df$B<-  with(df, A > 2)
subset(df, B == F)
like image 41
JAponte Avatar answered Oct 28 '22 02:10

JAponte