Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove values outside Loess curve limits

I am looking to remove outliers before I apply a model. I am using a Loess curve to delimit the trend line and have set outlier limits. I would like to remove the rows that are outside the defined limits. Besides doing this with a custom function that takes each point one at a time and checks the local Loess slope etc... is there an easier way?

Loess trend line with limits (1.2)

# Code generating image above
scatter.smooth( idam$T_d, idam$T_x10d)
loessline <- loess.smooth( idam$T_d, idam$T_x10d)
lines(loessline$x, loessline$y, lwd=3)
lines(loessline$x, loessline$y*1.2, lwd=3, col='red')
lines(loessline$x, loessline$y/1.2, lwd=3, col='red')
like image 779
Cyrille Avatar asked Nov 07 '14 12:11

Cyrille


1 Answers

Detecting outliers could be done with help of DBSCAN R package, well-known algorithm used for cluster identification (see WIKIPEDIA for more details ).

This function has three important inputs:

  • x: your data (only numeric value)
  • eps: the target maximal distance
  • minPts: number of minimal points to consider them as a cluster

Evaluating eps can be done with help of knndist(...) and knndistplot(...) functions:

  • knndistplot will plot the eps values on your data set for a given k (i.e. minPts) ==> you could visually select an effective eps value (generally in the knee curve part)
  • knndist will evaluate eps values and return them in matrix from. the k input will generate 1:1:k valuations and you could use the result for determining programatically the accurate eps & k values

Next you have just to use dbscan(yourdata, eps, k) for getting a dbscan object with following components:

  • eps: the eps used for calculations
  • minPts: the minimal number of points for identifying a cluster
  • cluster: an integer vector identifying points which belongs to a cluster (=1) or not (=0). The last ones correspond to the outliers you want to eliminate.

Please note the following limitations on dbscan:

  • dbscan use euclidean distance and so it is submitted to "Curse of Dimensions". This can be avoided with use of PCA
  • dbscan eliminates superposed points which might generate unidentified points. This can be solved either by merging the results with your data with a left outer join or by using jitter(...) function which adds noise to your data. According to the data you displayed, I think it could be the case for your data

knowing this limitations, dbscan package offers two alternative methods : LOF and OPTICS (an extension of DBSCAN)

Edit on 25 Jabuary 2016

Following @rawr answer, I'm giving an example based on mtcars dataset to show how to use dbscan for identifying outliers. Please note that my example will use the excellent data.table package instead of the classical data.frame.

First, I start to replicate rawr's approach for illustrating the use of data.table

require(data.table)
require(ggplot2)
require(dbscan)
data(mtcars)
dt_mtcars <- as.data.table(mtcars)

# based on rawr's approach
plot(wt~mpg, data=dt_mtcars)
lo <- loess.smooth(dt_mtcars[,mpg], dt_mtcars[,wt])
lines(lo$x,lo$y, lwd=3)
lines(lo$x,lo$y * 1.2, lwd=3 , col=2 )
lines(lo$x,lo$y / 1.2, lwd=3 , col=2 )

enter image description here

Thereby, we can assess that we get the same results independently of underlying support.

Second, the following code illustrate the DBSCAN approach which starts with determining eps and k, the necessary number of points to identifying a cluster:

res_knn = kNNdist( dt_mtcars[, .(wt, mpg)] , k = 10)
dim_knn = dim(res_knn)
x_knn =  seq(1, dim_knn[1])
ggplot() + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 1])  , col = 1 ) ) + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 2])  , col = 2 ) ) + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 3])  , col = 3 ) ) + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 4])  , col = 4 ) ) + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 5])  , col = 5 ) ) + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 6])  , col = 6 ) ) + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 7])  , col = 7 ) ) + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 8])  , col = 8 ) ) + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 9])  , col = 9 ) ) + 
   geom_line( aes( x = x_knn , y = sort(res_knn[, 10]) , col = 10 ) )  +
   xlab('sorted results') + 
   ylab('kNN distance')

The results are plotted in following graphic:

enter image description here

It shows that the calculated kNN distance is sensitive to the factor k however the accurate eps value for segregating outliers is located in the knee part of the curves ==> suitable eps is located between 2 and 4. This is a visual assessment which could be automatized with appropriate search algorithms (for instance, see this link). Regarding k, a tradeoff must be defined, knowing the lower k is, the less stringent results are.

In next part, we will parametrize dbscan with eps = 3 (based on visual estimation) and k = 4 for having slight stringent results. We will plot these results with help of rawr's code:

eps = 3
k = 4
res_dbscan = dbscan( dt_mtcars[, .(wt, mpg)] , eps , k )
plot(wt~mpg, data=dt_mtcars, col = res_dbscan$cluster)
lo <- loess.smooth(dt_mtcars[res_dbscan$cluster>0,mpg], dt_mtcars[res_dbscan$cluster>0,wt])
lines(lo$x,lo$y, lwd=3)
lines(lo$x,lo$y * 1.2, lwd=3 , col=2 )
lines(lo$x,lo$y / 1.2, lwd=3 , col=2 )

enter image description here

we got this figure where we can assess that we got different results from rawr's approach, where the points located in mpg = [10,13] are considered as outliers.

These results could considered as odd comparing to rawr's solution, which works under the assumptions of having bivariate data (Y ~ X). However mtcars is a multidimensional dataset where relationship between variables might be (or not) linear... To assess this point, we could scatterplot this dataset, filtered on numerical values for instance

pairs(dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)])

enter image description here

If we focus only on the result wt ~ mpg, we might think it is an anti-linear relationship at first look. But with other plotted relationships, this might be not the case and finding outliers in N-Dim environment is a bit trickier. Indeed one point might be considered as outlier when projected in particular 2D comparison... but inversely if we add a new comparison dimension. Indeed we might have collinearity which could be identified and so strengthen the cluster relationship or not.

My friends, I agree it's a lot of if and in order to illustrate this situation, we will proceed to a dbscan analysis on numeric values of mtcars.

So I will replicate the process presented earlier and let's start with the kNN distance analysis:

res_knn = kNNdist( dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)] , k = 10)
dim_knn = dim(res_knn)
x_knn =  seq(1, dim_knn[1])
ggplot() + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 1])  , col = 1 ) ) + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 2])  , col = 2 ) ) + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 3])  , col = 3 ) ) + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 4])  , col = 4 ) ) + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 5])  , col = 5 ) ) + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 6])  , col = 6 ) ) + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 7])  , col = 7 ) ) + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 8])  , col = 8 ) ) + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 9])  , col = 9 ) ) + 
    geom_line( aes( x = x_knn , y = sort(res_knn[, 10]) , col = 10 ) )  +
    xlab('sorted results') + 
    ylab('kNN distance')

sorted kNN distances

Comparatively to the analysis produced on wt ~ mpg, we can see that kNNdist(...) produce far more important kNN distance (until 200 with k = 10 for instance). However we still have the knee part which help us to estimate a suitable eps value.

In next part, we will use eps = 75 and k = 5 and

# optimal eps value is between 40 (k=1) and 130 (k=10)
eps = 75
k = 5
res_dbscan = dbscan( dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)] , eps , k )
pairs(dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)] , col = res_dbscan$cluster+2L)

enter image description here

Thereby the scatterplot of this analysis highlights that identifying outliers could be tricky in N-Dim environment, due to complex relationships between variables. But please note that in most of cases, the outliers are located in a corner part of the 2D projection, which strengthen the results obtained with wt ~ mpg

like image 104
Bruno Sarrant Avatar answered Oct 13 '22 00:10

Bruno Sarrant