Q: given an array of integers like
[1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
I need to mask elements that repeat more than N
times. The goal is to retrieve the boolean mask array.
I came up with a rather complicated solution:
import numpy as np
bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])
N = 3
splits = np.split(bins, np.where(np.diff(bins) != 0)[0]+1)
mask = []
for s in splits:
if s.shape[0] <= N:
mask.append(np.ones(s.shape[0]).astype(np.bool_))
else:
mask.append(np.append(np.ones(N), np.zeros(s.shape[0]-N)).astype(np.bool_))
mask = np.concatenate(mask)
giving e.g.
bins[mask]
Out[90]: array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])
Is there a nicer way to do this?
Wrap-up: Here's a slim version of MSeifert's benchmark plot (thanks for pointing me to simple_benchmark
). Showing the four most performant options:
The idea proposed by Florian H, modified by Paul Panzer seems to be a great way of solving this problem as it is pretty straight forward and numpy
-only. If you're fine with using numba
, MSeifert's solution outperforms the other.
I chose to accept MSeifert's answer as solution as it is the more general answer: It correctly handles arbitrary arrays with (non-unique) blocks of consecutive repeating elements. In case numba
is a no-go, Divakar's answer is also worth a look.
NumPy: repeat() function The repeat() function is used to repeat elements of an array. Input array. The number of repetitions for each element. repeats is broadcasted to fit the shape of the given axis.
A masked array is the combination of a standard numpy. ndarray and a mask. A mask is either nomask , indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.
Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at once. For example, consider the following array: import numpy as np rand = np. random. RandomState(42) x = rand.
The numpy.tile() function constructs a new array by repeating array – 'arr', the number of times we want to repeat as per repetitions. The resulted array will have dimensions max(arr.ndim, repetitions) where, repetitions is the length of repetitions.
Disclaimer: this is just a sounder implementation of @FlorianH's idea:
def f(a,N):
mask = np.empty(a.size,bool)
mask[:N] = True
np.not_equal(a[N:],a[:-N],out=mask[N:])
return mask
For larger arrays this makes a huge difference:
a = np.arange(1000).repeat(np.random.randint(0,10,1000))
N = 3
print(timeit(lambda:f(a,N),number=1000)*1000,"us")
# 5.443050000394578 us
# compare to
print(timeit(lambda:[True for _ in range(N)] + list(bins[:-N] != bins[N:]),number=1000)*1000,"us")
# 76.18969900067896 us
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With