Is there a numpy builtin to reject outliers from a list

People also ask

How do you exclude outliers?

We can calculate the mean and standard deviation of a given sample, then calculate the cut-off for identifying outliers as more than 3 standard deviations from the mean. We can then identify outliers as those examples that fall outside of the defined lower and upper limits.

Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.

Building on eumiro's answer:

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m]

Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.

Note that for the data[s<m] syntax to work, data must be a numpy array.

This method is almost identical to yours, just more numpyst (also working on numpy arrays only):

def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]

Benjamin Bannier's answer yields a pass-through when the median of distances from the median is 0, so I found this modified version a bit more helpful for cases as given in the example below.

def reject_outliers_2(data, m=2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d / (mdev if mdev else 1.)
    return data[s < m]

Example:

data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))

Gives:

[[10, 10, 10, 17, 10, 10]]  # 17 is not filtered
[10, 10, 10, 10, 10]  # 17 is filtered (it's distance, 7, is greater than m)

Building on Benjamin's, using pandas.Series, and replacing MAD with IQR:

def reject_outliers(sr, iq_range=0.5):
    pcnt = (1 - iq_range) / 2
    qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
    iqr = qhigh - qlow
    return sr[ (sr - median).abs() <= iqr]

For instance, if you set iq_range=0.6, the percentiles of the interquartile-range would become: 0.20 <--> 0.80, so more outliers will be included.

An alternative is to make a robust estimation of the standard deviation (assuming Gaussian statistics). Looking up online calculators, I see that the 90% percentile corresponds to 1.2815σ and the 95% is 1.645σ (http://vassarstats.net/tabs.html?#z)

As a simple example:

import numpy as np

# Create some random numbers
x = np.random.normal(5, 2, 1000)

# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500

# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)

rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)

The output I get is:

Mean=  4.99760520022
Median=  4.95395274981
Max/Min= 11.1226494654   -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649

Mean=  9.64760520022
Median=  4.95667658782
Max/Min= 2205.43861943   -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694

Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462

Which is close to the expected value of 2.

If we want to remove points above/below 5 standard deviations (with 1000 points we would expect 1 value > 3 standard deviations):

y = x[abs(x - p50) < rSig*5]

# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))

Which gives:

Mean=  4.99755359935
Median=  4.95213030447
Max/Min= 11.1226494654   -2.15388472011
StdDev= 1.97692712883

I have no idea which approach is the more efficent/robust

Related questions
                            
                                Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation
                            
                                Move an item inside a list?
                            
                                How to write string literals in python without having to escape them?
                            
                                How to solve ReadTimeoutError: HTTPSConnectionPool(host='pypi.python.org', port=443) with pip?
                            
                                Iterate an iterator by chunks (of n) in Python? [duplicate]
                            
                                Useful code which uses reduce()? [closed]
                            
                                Use logging print the output of pprint
                            
                                Generics/templates in python?
                            
                                How to add a title to Seaborn Facet Plot
                            
                                How can I split a column of tuples in a Pandas dataframe?
                            
                                googletrans stopped working with error 'NoneType' object has no attribute 'group'
                            
                                How can I set up a virtual environment for Python in Visual Studio Code?
                            
                                Understanding Python's "is" operator
                            
                                Asterisk in function call [duplicate]
                            
                                How can I reorder a list? [closed]
                            
                                Django TemplateSyntaxError - 'staticfiles' is not a registered tag library
                            
                                Rank items in an array using Python/NumPy, without sorting array twice
                            
                                Loading a file with more than one line of JSON into Pandas
                            
                                Getting MAC Address
                            
                                Iterate through pairs of items in a Python list [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a numpy builtin to reject outliers from a list

Tags:

python

numpy

People also ask

Recent Activity

Donate For Us