Multivariate Outlier Detection using R with probability

Tags:

I have been searching everywhere for the best method to identify the multivariate outliers using R but I don't think I have found any believable approach yet.

We can take the iris data as an example as my data also contains multiple fields

data(iris)
df <- iris[, 1:4] #only taking the four numeric fields

Firstly, I am using Mahalanobis distance from the library MVN

library(MVN)
result <- mvOutlier(df, qqplot = TRUE, method = "quan") #non-adjusted
result <- mvOutlier(df, qqplot = TRUE, method = "adj.quan") #adjusted Mahalonobis distance

Both resulted in a large number of outliers (50 out of 150 for non-adjusted and 49/150 for adjusted), which I think needs more refinement. I unfortunately can't seem to find a variable in the mvOutlier method to set the threshold (says increasing the probability of a point being an outlier, so that we have a smaller number)

Secondly, I used outliers library. This is to find univariate outliers. So that, my plan is to find the outliers on each dimension of the data and those points being outliers on all the dimensions are regarded as outliers of the dataset.

library(outliers)
result <- scores(df, type="t", prob=0.95) #t test, probability is 0.95
result <- subset(result, result$Sepal.Length == T & result$Sepal.Width == T & result$Petal.Length == T & result$Petal.Width == T)

For this we can set the probability, but I don't think it can replace the multivariate outlier detection.

Some other approaches that I tried

library(mvoutlier): this only shows the plot. It is hard to automatically find outliers. And I don't know how to add the probability into this
cook's distance (link): a man said that he used cook's distance but I don't think there is any strong academic proof to prove that this is ok.

926

asked Jan 04 '17 10:01

Duy Bui

1 Answers

I'll leave you with these two links, the first is a paper on different methods for multivariate outlier detection, while the second one is looking at how to implement these in R.

Cook's Distance is a valid way of looking at the influence a datapoint has, and as such help detect outlying points. Mahalanobis Distance is also used regularly.

For your test example, the iris dataset is not useful. It is used for classification problems as it is clearly separable. Your exclusion of 50 data points would be getting rid of an entire species.

Outlier Detection in Multivariate Data-

http://www.m-hikari.com/ams/ams-2015/ams-45-48-2015/13manojAMS45-48-2015-96.pdf

R implementation

http://r-statistics.co/Outlier-Treatment-With-R.html

172

answered Sep 23 '22 07:09

Andrew Haynes

Related questions
                            
                                automatically detect date columns when reading a file into a data.frame
                            
                                plm: using fixef() to manually calculate fitted values for a fixed effects twoways model
                            
                                ggplot2 move x-axis to top (intersect with reversed y axis at 0) [duplicate]
                            
                                Drawing labels on flat section of contour lines in ggplot2
                            
                                Shiny: printing console output to a text object without waiting for a function to finish
                            
                                knitr - error when importing python module
                            
                                R Supervised Latent Dirichlet Allocation Package
                            
                                Unable to allocate vector in R with plenty of memory available
                            
                                R integration with node JS
                            
                                Connect to RServe from JAVA using authentication
                            
                                CRAN notes that files cannot be checked without ‘pandoc’ being installed
                            
                                Is it safe to use "df" as the name for a dataframe?
                            
                                Best method of spatial interpolation for geographic heat/contour maps?
                            
                                notepad++ run selected code in python console seamlessly
                            
                                How to cross-reference an equation in an R help file/roxygen2
                            
                                Dynamically Generate Reference Classes
                            
                                Knitr inline chunk options (no evaluation) or just render highlighted code
                            
                                How to have conditional formatting of data frames in R Shiny?
                            
                                geom_vline() with date gives Error: Discrete value supplied to continuous scale
                            
                                How can I explicitly set column width for R DT tables using R Markdown?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multivariate Outlier Detection using R with probability

Tags:

r

outliers

mahalanobis

multivariate-testing

Duy Bui

People also ask

1 Answers

Andrew Haynes

Recent Activity

Donate For Us