I would like to filter a numpy
array
(or pandas
DataFrame
) in a way that only continuous series of the same value with at least window_size
length is kept and everything else set to 0.
For example:
[1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1]
should become when using a window size of 4
[0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1]
I've tried using rolling_apply
and scipy.ndimage.filtes.gerneric_filter
but due to the nature of rolling kernel functions I don't think it is the right approach here (and I am stuck with it at the moment).
I insert my attempt here anyway:
import numpy as np
import pandas as pd
import scipy
#from scipy import ndimage
df= pd.DataFrame({'x':np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1])})
df_alt = df.copy()
def filter_df(df, colname, window_size):
rolling_func = lambda z: z.sum() >= window_size
df[colname] = pd.rolling_apply(df[colname],
window_size,
rolling_func,
min_periods=window_size/2,
center = True)
def filter_alt(df, colname, window_size):
rolling_func = lambda z: z.sum() >= window_size
return scipy.ndimage.filters.generic_filter(df[colname].values,
rolling_func,
size = window_size,
origin = 0)
window_size = 4
filter_df(df, 'x', window_size)
print df
filter_alt(df_alt, 'x', window_size)
NumPy provides n-dimensional arrays, Data Type (dtype), etc. as objects. The indexing of pandas series is significantly slower than the indexing of NumPy arrays. The indexing of NumPy arrays is much faster than the indexing of Pandas arrays.
If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.
NumPy uses much less memory to store data The NumPy arrays takes significantly less amount of memory as compared to python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code.
numpy.r_[array[], array[]] This is used to concatenate any number of array slices along row (first) axis. This is a simple way to create numpy arrays quickly and efficiently.
That is basically an image closing operation in image-processing
for a 1D case though. Such operations could be implemented with convolution methods. Now, NumPy does support 1D convolution
, so we are in luck! Thus, to solve our case, it would be something like this -
def conv_app(A, WSZ):
K = np.ones(WSZ,dtype=int)
L = WSZ-1
return (np.convolve(np.convolve(A,K)>=WSZ,K)[L:-L]>0).astype(int)
Sample run -
In [581]: A
Out[581]: array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])
In [582]: conv_app(A,4)
Out[582]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])
In [583]: A = np.append(1,A) # Append 1 and see what happens!
In [584]: A
Out[584]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])
In [585]: conv_app(A,4)
Out[585]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])
Runtime tests -
This section benchmarks couple of other approaches listed to solve the posted question. Their definitions are listed below -
def groupby_app(A,WSZ): # @lambo477's solution
groups = itertools.groupby(A)
result = []
for group in groups:
group_items = [item for item in group[1]]
group_length = len(group_items)
if group_length >= WSZ:
result.extend([item for item in group_items])
else:
result.extend([0]*group_length)
return result
def stride_tricks_app(arr, window): # @ajcr's solution
x = pd.rolling_min(arr, window)
x[:window-1] = 0
y = np.lib.stride_tricks.as_strided(x, (len(x)-window+1, window), (8, 8))
y[y[:, -1] == 1] = 1
return x.astype(int)
Timings -
In [541]: A = np.random.randint(0,2,(100000))
In [542]: WSZ = 4
In [543]: %timeit groupby_app(A,WSZ)
10 loops, best of 3: 74.5 ms per loop
In [544]: %timeit stride_tricks_app(A,WSZ)
100 loops, best of 3: 3.35 ms per loop
In [545]: %timeit conv_app(A,WSZ)
100 loops, best of 3: 2.82 ms per loop
You could use itertools.groupby
as follows:
import itertools
import numpy as np
my_array = np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1])
window_size = 4
groups = itertools.groupby(my_array)
result = []
for group in groups:
group_items = [item for item in group[1]]
group_length = len(group_items)
if group_length >= window_size:
result.extend([item for item in group_items])
else:
result.extend([0]*group_length)
print(result)
Output
[0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With