Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering pandas or numpy arrays for continuous series with minimum window length

I would like to filter a numpy array (or pandas DataFrame) in a way that only continuous series of the same value with at least window_size length is kept and everything else set to 0.

For example:

[1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1]

should become when using a window size of 4

[0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1]

I've tried using rolling_apply and scipy.ndimage.filtes.gerneric_filter but due to the nature of rolling kernel functions I don't think it is the right approach here (and I am stuck with it at the moment).

I insert my attempt here anyway:

import numpy as np
import pandas as pd
import scipy
#from scipy import ndimage
df= pd.DataFrame({'x':np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1])})
df_alt = df.copy()
def filter_df(df, colname, window_size):
    rolling_func = lambda z: z.sum() >= window_size
    df[colname] = pd.rolling_apply(df[colname],
                                    window_size,
                                    rolling_func,
                                    min_periods=window_size/2,
                                    center = True) 

def filter_alt(df, colname, window_size):
    rolling_func = lambda z: z.sum() >= window_size
    return scipy.ndimage.filters.generic_filter(df[colname].values,
                                                rolling_func,
                                                size = window_size,                                       
                                                origin = 0)

window_size = 4
filter_df(df, 'x', window_size)
print df
filter_alt(df_alt, 'x', window_size)
like image 478
pho Avatar asked Jan 05 '16 16:01

pho


People also ask

Are NumPy arrays faster than pandas series?

NumPy provides n-dimensional arrays, Data Type (dtype), etc. as objects. The indexing of pandas series is significantly slower than the indexing of NumPy arrays. The indexing of NumPy arrays is much faster than the indexing of Pandas arrays.

Is NumPy faster than Dataframe?

If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.

Can NumPy arrays efficiently store data?

NumPy uses much less memory to store data The NumPy arrays takes significantly less amount of memory as compared to python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code.

What is NP R_?

numpy.r_[array[], array[]] This is used to concatenate any number of array slices along row (first) axis. This is a simple way to create numpy arrays quickly and efficiently.


2 Answers

That is basically an image closing operation in image-processing for a 1D case though. Such operations could be implemented with convolution methods. Now, NumPy does support 1D convolution, so we are in luck! Thus, to solve our case, it would be something like this -

def conv_app(A, WSZ):
    K = np.ones(WSZ,dtype=int)
    L = WSZ-1
    return (np.convolve(np.convolve(A,K)>=WSZ,K)[L:-L]>0).astype(int)

Sample run -

In [581]: A
Out[581]: array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])

In [582]: conv_app(A,4)
Out[582]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

In [583]: A = np.append(1,A) # Append 1 and see what happens!

In [584]: A
Out[584]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])

In [585]: conv_app(A,4)
Out[585]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

Runtime tests -

This section benchmarks couple of other approaches listed to solve the posted question. Their definitions are listed below -

def groupby_app(A,WSZ): # @lambo477's solution
    groups = itertools.groupby(A)
    result = []
    for group in groups:
        group_items = [item for item in group[1]]
        group_length = len(group_items)
        if group_length >= WSZ:
            result.extend([item for item in group_items])
        else:
            result.extend([0]*group_length)
    return result

def stride_tricks_app(arr, window): # @ajcr's solution
    x = pd.rolling_min(arr, window)
    x[:window-1] = 0
    y = np.lib.stride_tricks.as_strided(x, (len(x)-window+1, window), (8, 8))
    y[y[:, -1] == 1] = 1
    return x.astype(int)            

Timings -

In [541]: A = np.random.randint(0,2,(100000))

In [542]: WSZ = 4

In [543]: %timeit groupby_app(A,WSZ)
10 loops, best of 3: 74.5 ms per loop

In [544]: %timeit stride_tricks_app(A,WSZ)
100 loops, best of 3: 3.35 ms per loop

In [545]: %timeit conv_app(A,WSZ)
100 loops, best of 3: 2.82 ms per loop
like image 200
Divakar Avatar answered Oct 25 '22 07:10

Divakar


You could use itertools.groupby as follows:

import itertools
import numpy as np

my_array = np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1])
window_size = 4

groups = itertools.groupby(my_array)

result = []
for group in groups:
    group_items = [item for item in group[1]]
    group_length = len(group_items)
    if group_length >= window_size:
        result.extend([item for item in group_items])
    else:
        result.extend([0]*group_length)

print(result)

Output

[0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
like image 27
gtlambert Avatar answered Oct 25 '22 08:10

gtlambert