I have been searching everywhere for the best method to identify the multivariate outliers using R but I don't think I have found any believable approach yet.
We can take the iris data as an example as my data also contains multiple fields
data(iris)
df <- iris[, 1:4] #only taking the four numeric fields
Firstly, I am using Mahalanobis distance from the library MVN
library(MVN)
result <- mvOutlier(df, qqplot = TRUE, method = "quan") #non-adjusted
result <- mvOutlier(df, qqplot = TRUE, method = "adj.quan") #adjusted Mahalonobis distance
Both resulted in a large number of outliers (50 out of 150 for non-adjusted and 49/150 for adjusted), which I think needs more refinement. I unfortunately can't seem to find a variable in the mvOutlier method to set the threshold (says increasing the probability of a point being an outlier, so that we have a smaller number)
Secondly, I used outliers library. This is to find univariate outliers. So that, my plan is to find the outliers on each dimension of the data and those points being outliers on all the dimensions are regarded as outliers of the dataset.
library(outliers)
result <- scores(df, type="t", prob=0.95) #t test, probability is 0.95
result <- subset(result, result$Sepal.Length == T & result$Sepal.Width == T & result$Petal.Length == T & result$Petal.Width == T)
For this we can set the probability, but I don't think it can replace the multivariate outlier detection.
Some other approaches that I tried
Multivariate outliers can be identified with the use of Mahalanobis distance, which is the distance of a data point from the calculated centroid of the other cases where the centroid is calculated as the intersection of the mean of the variables being assessed.
In order to detect multivariate outliers, most psychologists compute the Mahalanobis distance (Mahalanobis, 1930; see also Leys et al.
The two main types of outlier detection methods are: Using distance and density of data points for outlier detection. Building a model to predict data point distribution and highlighting outliers which don't meet a user-defined threshold.
I'll leave you with these two links, the first is a paper on different methods for multivariate outlier detection, while the second one is looking at how to implement these in R.
Cook's Distance is a valid way of looking at the influence a datapoint has, and as such help detect outlying points. Mahalanobis Distance is also used regularly.
For your test example, the iris dataset is not useful. It is used for classification problems as it is clearly separable. Your exclusion of 50 data points would be getting rid of an entire species.
Outlier Detection in Multivariate Data-
http://www.m-hikari.com/ams/ams-2015/ams-45-48-2015/13manojAMS45-48-2015-96.pdf
R implementation
http://r-statistics.co/Outlier-Treatment-With-R.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With