Filtering pandas or numpy arrays for continuous series with minimum window length

Tags:

I would like to filter a numpy array (or pandas DataFrame) in a way that only continuous series of the same value with at least window_size length is kept and everything else set to 0.

For example:

[1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1]

should become when using a window size of 4

[0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1]

I've tried using rolling_apply and scipy.ndimage.filtes.gerneric_filter but due to the nature of rolling kernel functions I don't think it is the right approach here (and I am stuck with it at the moment).

I insert my attempt here anyway:

import numpy as np
import pandas as pd
import scipy
#from scipy import ndimage
df= pd.DataFrame({'x':np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1])})
df_alt = df.copy()
def filter_df(df, colname, window_size):
    rolling_func = lambda z: z.sum() >= window_size
    df[colname] = pd.rolling_apply(df[colname],
                                    window_size,
                                    rolling_func,
                                    min_periods=window_size/2,
                                    center = True) 

def filter_alt(df, colname, window_size):
    rolling_func = lambda z: z.sum() >= window_size
    return scipy.ndimage.filters.generic_filter(df[colname].values,
                                                rolling_func,
                                                size = window_size,                                       
                                                origin = 0)

window_size = 4
filter_df(df, 'x', window_size)
print df
filter_alt(df_alt, 'x', window_size)

478

asked Jan 05 '16 16:01

pho

2 Answers

That is basically an image closing operation in image-processing for a 1D case though. Such operations could be implemented with convolution methods. Now, NumPy does support 1D convolution, so we are in luck! Thus, to solve our case, it would be something like this -

def conv_app(A, WSZ):
    K = np.ones(WSZ,dtype=int)
    L = WSZ-1
    return (np.convolve(np.convolve(A,K)>=WSZ,K)[L:-L]>0).astype(int)

Sample run -

In [581]: A
Out[581]: array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])

In [582]: conv_app(A,4)
Out[582]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

In [583]: A = np.append(1,A) # Append 1 and see what happens!

In [584]: A
Out[584]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])

In [585]: conv_app(A,4)
Out[585]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

Runtime tests -

This section benchmarks couple of other approaches listed to solve the posted question. Their definitions are listed below -

def groupby_app(A,WSZ): # @lambo477's solution
    groups = itertools.groupby(A)
    result = []
    for group in groups:
        group_items = [item for item in group[1]]
        group_length = len(group_items)
        if group_length >= WSZ:
            result.extend([item for item in group_items])
        else:
            result.extend([0]*group_length)
    return result

def stride_tricks_app(arr, window): # @ajcr's solution
    x = pd.rolling_min(arr, window)
    x[:window-1] = 0
    y = np.lib.stride_tricks.as_strided(x, (len(x)-window+1, window), (8, 8))
    y[y[:, -1] == 1] = 1
    return x.astype(int)

Timings -

In [541]: A = np.random.randint(0,2,(100000))

In [542]: WSZ = 4

In [543]: %timeit groupby_app(A,WSZ)
10 loops, best of 3: 74.5 ms per loop

In [544]: %timeit stride_tricks_app(A,WSZ)
100 loops, best of 3: 3.35 ms per loop

In [545]: %timeit conv_app(A,WSZ)
100 loops, best of 3: 2.82 ms per loop

200

answered Oct 25 '22 07:10

Divakar

You could use itertools.groupby as follows:

import itertools
import numpy as np

my_array = np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1])
window_size = 4

groups = itertools.groupby(my_array)

result = []
for group in groups:
    group_items = [item for item in group[1]]
    group_length = len(group_items)
    if group_length >= window_size:
        result.extend([item for item in group_items])
    else:
        result.extend([0]*group_length)

print(result)

Output

[0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

answered Oct 25 '22 08:10

gtlambert

Related questions
                            
                                Why is this python generator returning the same value everytime?
                            
                                How can I prevent Numpy/ SciPy gaussian blur from converting image to grey scale?
                            
                                mocking a socket connection in Python
                            
                                Pandas data frame fill null values with index
                            
                                Using click.progressbar with multiprocessing in Python
                            
                                Pass a variable from python to shell script
                            
                                Actions before close python script
                            
                                How to increase thickness of polygon in PIL ImageDraw
                            
                                Pandas warning when using map: A value is trying to be set on a copy of a slice from a DataFrame
                            
                                python pandas date time conversion to date
                            
                                How to use markdown for python with pycharm?
                            
                                Updating client that a celery task has finished
                            
                                What is the difference between jsonify and tojson in Flask?
                            
                                Why should you manually run a garbage collection in python?
                            
                                Get selected data contained within box select tool in Bokeh
                            
                                How to save a spark RDD in gzip format through pyspark
                            
                                Flask: How to run a method before every route in a blueprint?
                            
                                Preserve the non-numerical columns when doing pandas.DataFrame.groupby().sum()
                            
                                Why `type(x).__enter__(x)` instead of `x.__enter__()` in Python standard contextlib?
                            
                                roc curve with sklearn [python]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filtering pandas or numpy arrays for continuous series with minimum window length

Tags:

performance

python

pandas

numpy

scipy

pho

People also ask

2 Answers

Divakar

gtlambert

Recent Activity

Donate For Us