Filtering Outliers - how to make median-based Hampel Function faster?

Tags:

I need to use a Hampel filter on my data, stripping outliers.

I haven't been able to find an existing one in Python; only in Matlab and R.

[Matlab function description][1]

[Stats Exchange discussion of Matlab Hampel function][2]

[R pracma package vignette; contains hampel function][3]

I've written the following function, modeling it off the function in the R pracma package; however, it is far far slower than the Matlab version. This is not ideal; would appreciate input on how to speed it up.

The function is shown below-

def hampel(x,k, t0=3):
    '''adapted from hampel function in R package pracma
    x= 1-d numpy array of numbers to be filtered
    k= number of items in window/2 (# forward and backward wanted to capture in median filter)
    t0= number of standard deviations to use; 3 is default
    '''
    n = len(x)
    y = x #y is the corrected series
    L = 1.4826
    for i in range((k + 1),(n - k)):
        if np.isnan(x[(i - k):(i + k+1)]).all():
            continue
        x0 = np.nanmedian(x[(i - k):(i + k+1)])
        S0 = L * np.nanmedian(np.abs(x[(i - k):(i + k+1)] - x0))
        if (np.abs(x[i] - x0) > t0 * S0):
            y[i] = x0
    return(y)

The R implementation in "pracma" package, which I am using as a model:

function (x, k, t0 = 3) 
{
    n <- length(x)
    y <- x
    ind <- c()
    L <- 1.4826
    for (i in (k + 1):(n - k)) {
        x0 <- median(x[(i - k):(i + k)])
        S0 <- L * median(abs(x[(i - k):(i + k)] - x0))
        if (abs(x[i] - x0) > t0 * S0) {
            y[i] <- x0
            ind <- c(ind, i)
        }
    }
    list(y = y, ind = ind)
}

Any help in making function more efficient, or a pointer to an existing implementation in an existing Python module would be much appreciated. Example data below; %%timeit cell magic in Jupyter indicates it currently takes 15 seconds to run:

vals=np.random.randn(250000)
vals[3000]=100
vals[200]=-9000
vals[-300]=8922273
%%timeit
hampel(vals, k=6)

[1]: https://www.mathworks.com/help/signal/ref/hampel.html [2]: https://dsp.stackexchange.com/questions/26552/what-is-a-hampel-filter-and-how-does-it-work [3]: https://cran.r-project.org/web/packages/pracma/pracma.pdf

370

asked Oct 18 '17 21:10

EHB

1 Answers

Solution by @EHB above is helpful, but it is incorrect. Specifically, the rolling median calculated in median_abs_deviation is of difference, which itself is the difference between each data point and the rolling median calculated in rolling_median, but it should be the median of differences between the data in the rolling window and the median over the window. I took the code above and modified it:

def hampel(vals_orig, k=7, t0=3):
    '''
    vals: pandas series of values from which to remove outliers
    k: size of window (including the sample; 7 is equal to 3 on either side of value)
    '''
    
    #Make copy so original not edited
    vals = vals_orig.copy()
    
    #Hampel Filter
    L = 1.4826
    rolling_median = vals.rolling(window=k, center=True).median()
    MAD = lambda x: np.median(np.abs(x - np.median(x)))
    rolling_MAD = vals.rolling(window=k, center=True).apply(MAD)
    threshold = t0 * L * rolling_MAD
    difference = np.abs(vals - rolling_median)
    
    '''
    Perhaps a condition should be added here in the case that the threshold value
    is 0.0; maybe do not mark as outlier. MAD may be 0.0 without the original values
    being equal. See differences between MAD vs SDV.
    '''
    
    outlier_idx = difference > threshold
    vals[outlier_idx] = rolling_median[outlier_idx] 
    return(vals)

175

answered Sep 19 '22 15:09

Eduardo Osorio

Related questions
                            
                                How to create unittests for python prompt toolkit?
                            
                                Why is ctypes so slow to convert a Python list to a C array?
                            
                                How to access directory file outside django project?
                            
                                Create a virtualenv from another virtualenv
                            
                                All possibilities to split a list into two lists
                            
                                Python typehints and linters
                            
                                How to resolve "ImportError: DLL load failed:" on Python?
                            
                                Getting "502 Bad Gateway" with nginx, uwsgi python-flask on ubuntu 16.04
                            
                                Why does PyYAML use generators to construct objects?
                            
                                Filter pandas data frame with results of groupby
                            
                                gspread "SpreadsheetNotFound" on certain workbooks
                            
                                subprocess python filenotfounderror: [winerror 2]
                            
                                Pandas index title in line with column headers
                            
                                How to save estimator in Tensorflow for later use?
                            
                                Keep only alphabetic characters (multilingual) in a string
                            
                                Python loading old version of sklearn
                            
                                PyPDF2 write doesn't work on some PDF files (Python 3.5.1)
                            
                                Javascript - parse string to long
                            
                                convert PIL Image object to File object
                            
                                Python - Manipulate and read browser from current browser

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filtering Outliers - how to make median-based Hampel Function faster?

Tags:

python

function

pandas

filter

EHB

People also ask

1 Answers

Eduardo Osorio

Recent Activity

Donate For Us