How do I tell R to remove an outlier when calculating correlation? I identified a potential outlier from a scatter plot, and am trying to compare correlation with and without this value. This is for an intro stats course; I am just playing with this data to start understanding correlation and outliers.
My data looks like this:
"Australia" 35.2 31794.13
"Austria" 29.1 33699.6
"Canada" 32.6 33375.5
"CzechRepublic" 25.4 20538.5
"Denmark" 24.7 33972.62
...
and so on, for 26 lines of data. I am trying to find the correlation of the first and second numbers.
I did read this question, however, I am only trying to remove a single point, not a percentage of points. Is there a command in R to do this?
You can't do that with the basic cor()
function but you can
use a correlation function from one of the robust statistics packages, eg robCov()
from package robust
use a winsorize()
function, eg from robustHD, to treat your data
Here is a quick example for the 2nd approach:
R> set.seed(42)
R> x <- rnorm(100)
R> y <- rnorm(100)
R> cor(x,y) # correlation of two unrelated series: almost zero
[1] 0.0312798
The we "contaminate" one point each with a big outlier:
R> x[50] <- y[50] <- 10
R> cor(x,y) # bigger correlation due to one bad data point
[1] 0.534996
So let's winsorize:
R> x <- robustHD::winsorize(x)
R> y <- robustHD::winsorize(y)
R> cor(x,y)
[1] 0.106519
R>
and we're back down to a less correlated measure.
If you apply the same conditional expression for both vectors you could exclude that "point".
cor( DF[2][ DF[2] > 100 ], # items in 2nd column excluded based on their values
DF[3][ DF[2] > 100 ] ) # items in 3rd col excluded based on the 2nd col values
In the following, I worked from the presumption (that I read between your lines) that you have identified that single outlier visually (ie., from a graph). From your limited data set it's probably easy to identify that point based on its value. If you have more data points, you could use something like this.
tmp <- qqnorm(bi$bias.index)
qqline(bi$bias.index)
(X <- identify(tmp, , labels=rownames(bi)))
qqnorm(bi$bias.index[-X])
qqline(bi$bias.index[-X])
Note that I just copied my own code because I couldn't work from sample code from you. Also check ?identify
before.
It makes sense to put all your data on a data frame, so it's easier to handle. I always like to keep track of outliers by using an extra column (in this case, B) in my data frame.
df <- data.frame(A=c(1,2,3,4,5), B=c(T,T,T,F,T))
And then filter out data I don't want before getting into the good analytical stuff.
myFilter <- with(df, B==T)
df[myFilter, ]
This way, you don't lose track of the outliers, and you are able to manage them as you see fit.
EDIT:
Improving upon my answer above, you could also use conditionals to define the outliers.
df <- data.frame(A=c(1,2,15,1,2))
df$B<- with(df, A > 2)
subset(df, B == F)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With