I have a pandas.Series of positive numbers. I need to find the indexes of "outliers", whose values depart by <code>3</code> or more from the previous "norm". How to vectorize this function: <pre class="prettyprint"><code>def baseline(s): values = [] indexes = [] last_valid = s.iloc[0] for idx, val in s.iteritems(): if abs(val - last_valid) >= 3: values.append(val) indexes.append(idx) else: last_valid = val return pd.Series(values, index=indexes) </code></pre> For example, if the input is: <pre class="prettyprint"><code>import pandas as pd s = pd.Series([7,8,9,10,14,10,10,14,100,14,10]) print baseline(s) </code></pre> the desired output is: <pre class="prettyprint"><code>4 14 7 14 8 100 9 14 </code></pre> Note that the <code>10</code> values after the <code>14</code>s are not returned because they are "back to normal" values. Edits: <ul> <li>Added <code>abs()</code> to the code. The numbers are positive.</li> <li>The purpose here is to speed up the code. </li> <li>An answer that doesn't exactly imitate the code may be acceptable.</li> <li>Changed the example to include another edge case, where the values slowly change by 3.</li> </ul>

Here's my original "vectorized" solution: You can get the <code>last_valid</code> using shift and numpy's where: <pre class="prettyprint"><code>In [1]: s = pd.Series([10, 10, 10, 14, 10, 10, 10, 14, 100, 14, 10]) In [2]: last_valid = pd.Series(np.where((s - s.shift()).abs() < 3, s, np.nan)) last_valid.iloc[0] = s.iloc[0] # initialize with first value of s last_valid.ffill(inplace=True) In [3]: last_valid Out[3]: 0 7 1 8 2 9 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 dtype: float64 </code></pre> This makes the problem much easier. You can compare this to <code>s</code>: <pre class="prettyprint"><code>In [4]: s - last_valid # alternatively use (s - last_valid).abs() Out[4]: 0 0 1 0 2 0 3 0 4 4 5 0 6 0 7 4 8 90 9 4 10 0 dtype: float64 </code></pre> Those elements which differ by more the +3: <pre class="prettyprint"><code>In [5]: (s - last_valid).abs() >= 3 Out[5]: 0 False 1 False 2 False 3 False 4 True 5 False 6 False 7 True 8 True 9 True 10 False dtype: bool In [6]: s[(s - last_valid).abs() >= 3] Out[6]: 4 14 7 14 8 100 9 14 dtype: int64 </code></pre> As desired. ...or so it would seem, @alko's example shows this isn't quite correct. <h3>Update</h3> As pointed out by @alko the below vectorized approach isn't quite correct, specifically for the example <code>s = pd.Series([10, 14, 11, 10, 10, 12, 14, 100, 100, 14, 10])</code>, my "vectorised" approach included the second 100 as "not an outlier" even though it is in baseline. This leads me (along with @alko) to think this can't be vectorized. As an alternative I've included a simple cython implementation (see cython section of pandas docs) which is significantly faster than the native python: <pre class="prettyprint"><code>%%cython cimport numpy as np import numpy as np cimport cython @cython.wraparound(False) @cython.boundscheck(False) cpdef _outliers(np.ndarray[double] s): cdef np.ndarray[Py_ssize_t] indexes cdef np.ndarray[double] vals cdef double last, val cdef Py_ssize_t count indexes = np.empty(len(s), dtype='int') vals = np.empty(len(s)) last = s[0] count = 0 for idx, val in enumerate(s): if abs(val - last) >= 3: indexes[count] = idx vals[count] = val count += 1 else: last = val return vals[:count], indexes[:count] def outliers(s): return pd.Series(*_outliers(s.values.astype('float'))) </code></pre> Some indication of timings: <pre class="prettyprint"><code>In [11]: s = pd.Series([10,10,12,14,100,100,14,10]) In [12]: %timeit baseline(s) 10000 loops, best of 3: 132 µs per loop In [13]: %timeit outliers(s) 10000 loops, best of 3: 46.8 µs per loop In [21]: s = pd.Series(np.random.randint(0, 100, 100000)) In [22]: %timeit baseline(s) 10 loops, best of 3: 161 ms per loop In [23]: %timeit outliers(s) 100 loops, best of 3: 9.43 ms per loop </code></pre> For more, see the cython (enhancing performance) section of the pandas docs.

How to find outliers in a series, vectorized?

Tags:

python

pandas

vectorization

numpy

I have a pandas.Series of positive numbers. I need to find the indexes of "outliers", whose values depart by 3 or more from the previous "norm".

How to vectorize this function:

def baseline(s):
    values = []
    indexes = []
    last_valid = s.iloc[0]
    for idx, val in s.iteritems():
        if abs(val - last_valid) >= 3:
            values.append(val)
            indexes.append(idx)
        else:
            last_valid = val
    return pd.Series(values, index=indexes)

For example, if the input is:

import pandas as pd
s = pd.Series([7,8,9,10,14,10,10,14,100,14,10])
print baseline(s)

the desired output is:

Note that the 10 values after the 14s are not returned because they are "back to normal" values.

Edits:

Added abs() to the code. The numbers are positive.
The purpose here is to speed up the code.
An answer that doesn't exactly imitate the code may be acceptable.
Changed the example to include another edge case, where the values slowly change by 3.

452

asked Dec 12 '13 09:12

Yariv

1 Answers

Here's my original "vectorized" solution:

You can get the last_valid using shift and numpy's where:

In [1]: s = pd.Series([10, 10, 10, 14, 10, 10, 10, 14, 100, 14, 10])

In [2]: last_valid = pd.Series(np.where((s - s.shift()).abs() < 3, s, np.nan))
        last_valid.iloc[0] = s.iloc[0]  # initialize with first value of s
        last_valid.ffill(inplace=True)

In [3]: last_valid
Out[3]:
0      7
1      8
2      9
3     10
4     10
5     10
6     10
7     10
8     10
9     10
10    10
dtype: float64

This makes the problem much easier. You can compare this to s:

In [4]: s - last_valid  # alternatively use (s - last_valid).abs()
Out[4]: 
0      0
1      0
2      0
3      0
4      4
5      0
6      0
7      4
8     90
9      4
10     0
dtype: float64

Those elements which differ by more the +3:

In [5]: (s - last_valid).abs() >= 3
Out[5]: 
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7      True
8      True
9      True
10    False
dtype: bool

In [6]: s[(s - last_valid).abs() >= 3]
Out[6]: 
4     14
7     14
8    100
9     14
dtype: int64

As desired. ...or so it would seem, @alko's example shows this isn't quite correct.

Update

As pointed out by @alko the below vectorized approach isn't quite correct, specifically for the example s = pd.Series([10, 14, 11, 10, 10, 12, 14, 100, 100, 14, 10]), my "vectorised" approach included the second 100 as "not an outlier" even though it is in baseline.

This leads me (along with @alko) to think this can't be vectorized. As an alternative I've included a simple cython implementation (see cython section of pandas docs) which is significantly faster than the native python:

%%cython
cimport numpy as np
import numpy as np
cimport cython
@cython.wraparound(False)
@cython.boundscheck(False)
cpdef _outliers(np.ndarray[double] s):
    cdef np.ndarray[Py_ssize_t] indexes
    cdef np.ndarray[double] vals
    cdef double last, val
    cdef Py_ssize_t count
    indexes = np.empty(len(s), dtype='int')
    vals = np.empty(len(s))
    last = s[0]
    count = 0
    for idx, val in enumerate(s):
        if abs(val - last) >= 3:
            indexes[count] = idx
            vals[count] = val
            count += 1
        else:
            last = val
    return vals[:count], indexes[:count]

def outliers(s):
    return pd.Series(*_outliers(s.values.astype('float')))

Some indication of timings:

In [11]: s = pd.Series([10,10,12,14,100,100,14,10])

In [12]: %timeit baseline(s)
10000 loops, best of 3: 132 µs per loop

In [13]: %timeit outliers(s)
10000 loops, best of 3: 46.8 µs per loop

In [21]: s = pd.Series(np.random.randint(0, 100, 100000))

In [22]: %timeit baseline(s)
10 loops, best of 3: 161 ms per loop

In [23]: %timeit outliers(s)
100 loops, best of 3: 9.43 ms per loop

For more, see the cython (enhancing performance) section of the pandas docs.

answered Sep 20 '22 14:09

Andy Hayden

Related questions
                            
                                OpenCV TypeError: contour is not a numpy array, neither a scalar
                            
                                Python: octal escape character \033 from a dictionary value translates in a print statement to a UTF-8 character instead
                            
                                Iterating over a list in parallel with Cython
                            
                                I get "TypeError: exceptions must derive from BaseException" even though I did define it
                            
                                simplifying maybe Monad
                            
                                How to check immutability [duplicate]
                            
                                What is the most effective way to incremente a large number of values in Python?
                            
                                Is there an alternative of RewriteRule / .htaccess for a Python http.server.HTTPServer?
                            
                                Import python module over the internet/multiple protocols or dynamically create module
                            
                                Python:Let Python int overflow like C int [duplicate]
                            
                                Python equivalent of Mathematica's Sow/Reap
                            
                                Source code being exposed by AWS Elastic Beanstalk
                            
                                How to change dtype of one column in DataFrame?
                            
                                Use Python to Write VBA Script?
                            
                                python raw socket: Protocol not supported
                            
                                Consistently getting ImportError: Could not import settings 'myapp.settings' error
                            
                                Difference between setup.py install and setup.py develop
                            
                                What's the correct None or null entry for a datetime.datetime object in Python?
                            
                                How to force larger steps on scipy.optimize functions?
                            
                                How to make my python integration faster?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With