I have set of 200 data rows(implies a small set of data). I want to carry out some statistical analysis, but before that I want to exclude outliers. What are the potential algos for the purpose? Accuracy is a matter of concern. I am very new to Stats, so need help in very basic algos.

Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons: <ol> <li>A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.</li> <li>The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot. </li> </ol> There are a few good ways to proceed: <ol> <li>Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.</li> <li>Trim or Winsorize your data. Trimming means removing the top and bottom x%. Winsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.</li> <li>If you have a small dataset, you could just plot your data and examine it manually for implausible values.</li> <li>If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.</li> </ol>

What are the efficient and accurate algorithms to exclude outliers from a set of data?

2 Answers

Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons:

A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.
The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot.

There are a few good ways to proceed:

Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.
Trim or Winsorize your data. Trimming means removing the top and bottom x%. Winsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.
If you have a small dataset, you could just plot your data and examine it manually for implausible values.
If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.

answered Nov 27 '22 17:11

dsimcha

Start by plotting the leverage of the outliers and then go for some good ol' interocular trauma (aka look at the scatterplot).

Lots of statistical packages have outlier/residual diagnostics, but I prefer Cook's D. You can calculate it by hand if you'd like using this formula from mtsu.edu (original link is dead, this is sourced from archive.org).

answered Nov 27 '22 15:11

eric.a.booth

Related questions
                            
                                Why does the Gamma distribution in SciPy have three parameters?
                            
                                Python statistics package(s) for bootstrapping confidence intervals and non-parametric multiple dataset comparisons
                            
                                Calculating Percentiles on the fly
                            
                                creating confidence area for normally distributed scatterplot in ggplot2 and R
                            
                                Logical Languages - Prolog or Lisp/Smalltalk or something else?
                            
                                How can I neatly clean my R workspace while preserving certain objects?
                            
                                Calculating Pearson correlation
                            
                                Looking for a Histogram Binning algorithm for decimal data
                            
                                Johansen cointegration test in python
                            
                                Algorithm For Ranking Items
                            
                                Java Apache Commons getPercentile() different result that MS Excel percentile
                            
                                how to generate bins for histogram using apache math 3.0 in java?
                            
                                How to plot logit and probit in ggplot2
                            
                                How are the "error bands" in Seaborn tsplot calculated?
                            
                                Understanding Bayes' Theorem
                            
                                Generating a uniform distribution of INTEGERS in C
                            
                                How to know when to use a particular kind of Similarity index? Euclidean Distance vs. Pearson Correlation
                            
                                Scipy: Pearson's correlation always returning 1
                            
                                Need good way to choose and adjust a "learning rate"
                            
                                How to find symmetric mean absolute error in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the efficient and accurate algorithms to exclude outliers from a set of data?

Tags:

statistics

outliers

Ashish Agarwal

People also ask

2 Answers

dsimcha

eric.a.booth

Recent Activity

Donate For Us