Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy 1D array: mask elements that repeat more than n times

Q: given an array of integers like

[1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]

I need to mask elements that repeat more than N times. The goal is to retrieve the boolean mask array.

I came up with a rather complicated solution:

import numpy as np

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])

N = 3
splits = np.split(bins, np.where(np.diff(bins) != 0)[0]+1)
mask = []
for s in splits:
    if s.shape[0] <= N:
        mask.append(np.ones(s.shape[0]).astype(np.bool_))
    else:
        mask.append(np.append(np.ones(N), np.zeros(s.shape[0]-N)).astype(np.bool_)) 

mask = np.concatenate(mask)

giving e.g.

bins[mask]
Out[90]: array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])

Is there a nicer way to do this?


Wrap-up: Here's a slim version of MSeifert's benchmark plot (thanks for pointing me to simple_benchmark). Showing the four most performant options: enter image description here

The idea proposed by Florian H, modified by Paul Panzer seems to be a great way of solving this problem as it is pretty straight forward and numpy-only. If you're fine with using numba, MSeifert's solution outperforms the other.

I chose to accept MSeifert's answer as solution as it is the more general answer: It correctly handles arbitrary arrays with (non-unique) blocks of consecutive repeating elements. In case numba is a no-go, Divakar's answer is also worth a look.

like image 984
FObersteiner Avatar asked Oct 21 '19 07:10

FObersteiner


People also ask

How do you repeat an element in a NumPy array?

NumPy: repeat() function The repeat() function is used to repeat elements of an array. Input array. The number of repetitions for each element. repeats is broadcasted to fit the shape of the given axis.

What is masked array in NumPy?

A masked array is the combination of a standard numpy. ndarray and a mask. A mask is either nomask , indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.

What is fancy indexing in Python?

Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at once. For example, consider the following array: import numpy as np rand = np. random. RandomState(42) x = rand.

What is NP tile ()?

The numpy.tile() function constructs a new array by repeating array – 'arr', the number of times we want to repeat as per repetitions. The resulted array will have dimensions max(arr.ndim, repetitions) where, repetitions is the length of repetitions.


1 Answers

Disclaimer: this is just a sounder implementation of @FlorianH's idea:

def f(a,N):
    mask = np.empty(a.size,bool)
    mask[:N] = True
    np.not_equal(a[N:],a[:-N],out=mask[N:])
    return mask

For larger arrays this makes a huge difference:

a = np.arange(1000).repeat(np.random.randint(0,10,1000))
N = 3

print(timeit(lambda:f(a,N),number=1000)*1000,"us")
# 5.443050000394578 us

# compare to
print(timeit(lambda:[True for _ in range(N)] + list(bins[:-N] != bins[N:]),number=1000)*1000,"us")
# 76.18969900067896 us
like image 80
Paul Panzer Avatar answered Sep 28 '22 17:09

Paul Panzer