Q: given an array of integers like <pre class="prettyprint"><code>[1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5] </code></pre> I need to mask elements that repeat more than <code>N</code> times. The goal is to retrieve the boolean mask array. I came up with a rather complicated solution: <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]) N = 3 splits = np.split(bins, np.where(np.diff(bins) != 0)[0]+1) mask = [] for s in splits: if s.shape[0] <= N: mask.append(np.ones(s.shape[0]).astype(np.bool_)) else: mask.append(np.append(np.ones(N), np.zeros(s.shape[0]-N)).astype(np.bool_)) mask = np.concatenate(mask) </code></pre> giving e.g. <pre class="prettyprint lang-py prettyprint-override"><code>bins[mask] Out[90]: array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5]) </code></pre> Is there a nicer way to do this? <hr> Wrap-up: Here's a slim version of MSeifert's benchmark plot (thanks for pointing me to <code>simple_benchmark</code>). Showing the four most performant options: <img src="https://i.stack.imgur.com/kY7fX.png" alt="enter image description here"> The idea proposed by Florian H, modified by Paul Panzer seems to be a great way of solving this problem as it is pretty straight forward and <code>numpy</code>-only. If you're fine with using <code>numba</code>, MSeifert's solution outperforms the other. I chose to accept MSeifert's answer as solution as it is the more general answer: It correctly handles arbitrary arrays with (non-unique) blocks of consecutive repeating elements. In case <code>numba</code> is a no-go, Divakar's answer is also worth a look.

Disclaimer: this is just a sounder implementation of @FlorianH's idea: <pre class="prettyprint"><code>def f(a,N): mask = np.empty(a.size,bool) mask[:N] = True np.not_equal(a[N:],a[:-N],out=mask[N:]) return mask </code></pre> For larger arrays this makes a huge difference: <pre class="prettyprint"><code>a = np.arange(1000).repeat(np.random.randint(0,10,1000)) N = 3 print(timeit(lambda:f(a,N),number=1000)*1000,"us") # 5.443050000394578 us # compare to print(timeit(lambda:[True for _ in range(N)] + list(bins[:-N] != bins[N:]),number=1000)*1000,"us") # 76.18969900067896 us </code></pre>

numpy 1D array: mask elements that repeat more than n times

Tags:

python

arrays

numpy

binning

Q: given an array of integers like

[1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]

I need to mask elements that repeat more than N times. The goal is to retrieve the boolean mask array.

I came up with a rather complicated solution:

import numpy as np

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])

N = 3
splits = np.split(bins, np.where(np.diff(bins) != 0)[0]+1)
mask = []
for s in splits:
    if s.shape[0] <= N:
        mask.append(np.ones(s.shape[0]).astype(np.bool_))
    else:
        mask.append(np.append(np.ones(N), np.zeros(s.shape[0]-N)).astype(np.bool_)) 

mask = np.concatenate(mask)

giving e.g.

bins[mask]
Out[90]: array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])

Is there a nicer way to do this?

Wrap-up: Here's a slim version of MSeifert's benchmark plot (thanks for pointing me to simple_benchmark). Showing the four most performant options: enter image description here

The idea proposed by Florian H, modified by Paul Panzer seems to be a great way of solving this problem as it is pretty straight forward and numpy-only. If you're fine with using numba, MSeifert's solution outperforms the other.

I chose to accept MSeifert's answer as solution as it is the more general answer: It correctly handles arbitrary arrays with (non-unique) blocks of consecutive repeating elements. In case numba is a no-go, Divakar's answer is also worth a look.

984

asked Oct 21 '19 07:10

FObersteiner

1 Answers

Disclaimer: this is just a sounder implementation of @FlorianH's idea:

def f(a,N):
    mask = np.empty(a.size,bool)
    mask[:N] = True
    np.not_equal(a[N:],a[:-N],out=mask[N:])
    return mask

For larger arrays this makes a huge difference:

a = np.arange(1000).repeat(np.random.randint(0,10,1000))
N = 3

print(timeit(lambda:f(a,N),number=1000)*1000,"us")
# 5.443050000394578 us

# compare to
print(timeit(lambda:[True for _ in range(N)] + list(bins[:-N] != bins[N:]),number=1000)*1000,"us")
# 76.18969900067896 us

answered Sep 28 '22 17:09

Paul Panzer

Related questions
                            
                                Pip doesn't install latest available version from pypi (argparse in this case)
                            
                                Creating same random number sequence in Python, NumPy and R
                            
                                How to get SQLite result/error codes in Python
                            
                                How to solve the 10054 error
                            
                                Retrieve the command line arguments of the Python interpreter
                            
                                Most efficient way to remove multiple substrings from string?
                            
                                Customize location of .so file generated by Cython
                            
                                How to cope with the performance of generating signed URLs for accessing private content via CloudFront?
                            
                                In locust How to get a response from one task and pass it to other task
                            
                                np.isnan on arrays of dtype "object"
                            
                                Difference between web-based and executable installers for Python 3 on Windows
                            
                                docker python custom module not found
                            
                                Connect MySQL with Python 3.6 [closed]
                            
                                Removing cached files after a pytest run
                            
                                Write to /tmp directory in aws lambda with python
                            
                                pandas rolling window & datetime indexes: What does `offset` mean?
                            
                                Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned
                            
                                What is a chain in PyMC3?
                            
                                How to improve the performance of this data pipeline for my tensorflow model
                            
                                Inputs to eager execution function cannot be Keras symbolic tensors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With