Suppose I have a bunch of arrays, including <code>x</code> and <code>y</code>, and I want to check if they're equal. Generally, I can just use <code>np.all(x == y)</code> (barring some dumb corner cases which I'm ignoring now). However this evaluates the entire array of <code>(x == y)</code>, which is usually not needed. My arrays are really large, and I have a lot of them, and the probability of two arrays being equal is small, so in all likelihood, I really only need to evaluate a very small portion of <code>(x == y)</code> before the <code>all</code> function could return False, so this is not an optimal solution for me. I've tried using the builtin <code>all</code> function, in combination with <code>itertools.izip</code>: <code>all(val1==val2 for val1,val2 in itertools.izip(x, y))</code> However, that just seems much slower in the case that two arrays are equal, that overall, it's stil not worth using over <code>np.all</code>. I presume because of the builtin <code>all</code>'s general-purposeness. And <code>np.all</code> doesn't work on generators. Is there a way to do what I want in a more speedy manner? I know this question is similar to previously asked questions (e.g. Comparing two numpy arrays for equality, element-wise) but they specifically don't cover the case of early termination.

Probably someone who understands the underlying data structure could optimize this or explain whether it's reliable/safe/good practice, but it seems to work. <pre class="prettyprint"><code>np.all(a==b) Out[]: True memoryview(a.data)==memoryview(b.data) Out[]: True %timeit np.all(a==b) The slowest run took 10.82 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.2 µs per loop %timeit memoryview(a.data)==memoryview(b.data) The slowest run took 8.55 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 1.85 µs per loop </code></pre> If I understand this correctly, <code>ndarray.data</code> creates a pointer to the data buffer and <code>memoryview</code> creates a native python type that can be short-circuited out of the buffer. I think. EDIT: further testing shows it may not be as big a time-improvement as shown. previously <code>a=b=np.eye(5)</code> <pre class="prettyprint"><code>a=np.random.randint(0,10,(100,100)) b=a.copy() %timeit np.all(a==b) The slowest run took 6.70 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 17.7 µs per loop %timeit memoryview(a.data)==memoryview(b.data) 10000 loops, best of 3: 30.1 µs per loop np.all(a==b) Out[]: True memoryview(a.data)==memoryview(b.data) Out[]: True </code></pre>

Hmmm, I know it is the poor answer but it seems there is no easy way for this. Numpy Creators should fix it. I suggest: <pre class="prettyprint"><code>def compare(a, b): if len(a) > 0 and not np.array_equal(a[0], b[0]): return False if len(a) > 15 and not np.array_equal(a[:15], b[:15]): return False if len(a) > 200 and not np.array_equal(a[:200], b[:200]): return False return np.array_equal(a, b) </code></pre> :)

Check if two numpy arrays are identical

Tags:

python

numpy

Suppose I have a bunch of arrays, including x and y, and I want to check if they're equal. Generally, I can just use np.all(x == y) (barring some dumb corner cases which I'm ignoring now).

However this evaluates the entire array of (x == y), which is usually not needed. My arrays are really large, and I have a lot of them, and the probability of two arrays being equal is small, so in all likelihood, I really only need to evaluate a very small portion of (x == y) before the all function could return False, so this is not an optimal solution for me.

I've tried using the builtin all function, in combination with itertools.izip: all(val1==val2 for val1,val2 in itertools.izip(x, y))

However, that just seems much slower in the case that two arrays are equal, that overall, it's stil not worth using over np.all. I presume because of the builtin all's general-purposeness. And np.all doesn't work on generators.

Is there a way to do what I want in a more speedy manner?

I know this question is similar to previously asked questions (e.g. Comparing two numpy arrays for equality, element-wise) but they specifically don't cover the case of early termination.

377

asked May 15 '17 07:05

acdr

4 Answers

Until this is implemented in numpy natively you can write your own function and jit-compile it with numba:

import numpy as np
import numba as nb


@nb.jit(nopython=True)
def arrays_equal(a, b):
    if a.shape != b.shape:
        return False
    for ai, bi in zip(a.flat, b.flat):
        if ai != bi:
            return False
    return True


a = np.random.rand(10, 20, 30)
b = np.random.rand(10, 20, 30)


%timeit np.all(a==b)  # 100000 loops, best of 3: 9.82 µs per loop
%timeit arrays_equal(a, a)  # 100000 loops, best of 3: 9.89 µs per loop
%timeit arrays_equal(a, b)  # 100000 loops, best of 3: 691 ns per loop

Worst case performance (arrays equal) is equivalent to np.all and in case of early stopping the compiled function has the potential to outperform np.all a lot.

185

answered Oct 19 '22 22:10

MB-F

Adding short-circuit logic to array comparisons is apparently being discussed on the numpy page on github, and will thus presumably be available in a future version of numpy.

answered Oct 19 '22 22:10

acdr

Probably someone who understands the underlying data structure could optimize this or explain whether it's reliable/safe/good practice, but it seems to work.

np.all(a==b)
Out[]: True

memoryview(a.data)==memoryview(b.data)
Out[]: True

%timeit np.all(a==b)
The slowest run took 10.82 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.2 µs per loop

%timeit memoryview(a.data)==memoryview(b.data)
The slowest run took 8.55 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.85 µs per loop

If I understand this correctly, ndarray.data creates a pointer to the data buffer and memoryview creates a native python type that can be short-circuited out of the buffer.

I think.

EDIT: further testing shows it may not be as big a time-improvement as shown. previously a=b=np.eye(5)

a=np.random.randint(0,10,(100,100))

b=a.copy()

%timeit np.all(a==b)
The slowest run took 6.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 17.7 µs per loop

%timeit memoryview(a.data)==memoryview(b.data)
10000 loops, best of 3: 30.1 µs per loop

np.all(a==b)
Out[]: True

memoryview(a.data)==memoryview(b.data)
Out[]: True

answered Oct 19 '22 21:10

Daniel F

Hmmm, I know it is the poor answer but it seems there is no easy way for this. Numpy Creators should fix it. I suggest:

def compare(a, b):
    if len(a) > 0 and not np.array_equal(a[0], b[0]):
        return False
    if len(a) > 15 and not np.array_equal(a[:15], b[:15]):
        return False
    if len(a) > 200 and not np.array_equal(a[:200], b[:200]):
        return False
    return np.array_equal(a, b)

answered Oct 19 '22 21:10

Śmigło

Related questions
                            
                                More efficient way to clean a column of strings and add a new column
                            
                                How to serve an image from google cloud storage using python flask
                            
                                Pandas: create a dataframe from 2D numpy arrays preserving their sequential order
                            
                                Divide list to multiple lists based on elements value
                            
                                Pandas: Dataframe.Drop - ValueError: labels ['id'] not contained in axis
                            
                                Anaconda "failed to create process"
                            
                                Yes/No prompt in Python3 using strtobool
                            
                                How to optimize MAPE code in Python?
                            
                                Non-blocking requests in Sanic framework
                            
                                Don't understand cause of "IndexError: tuple index out of range" when formatting string
                            
                                How to create groups and assign permission during project setup in django?
                            
                                NumPy: calculate cumulative median
                            
                                Prevent deletion of parent row if it's child will be orphaned in SQLAlchemy
                            
                                How should I pass my s3 credentials to Python lambda function on AWS?
                            
                                Tensorflow dynamic RNN (LSTM): how to format input?
                            
                                python arabic encoding issue
                            
                                Pandas df to database using flask-sqlalchemy
                            
                                How can I use a text file as database in Python?
                            
                                scheduled sampling in Tensorflow
                            
                                Is there a way to obtain the instance id within an ec2 instance [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Check if two numpy arrays are identical

Tags:

python

numpy

acdr

People also ask

4 Answers

MB-F

acdr

Daniel F

Śmigło

Recent Activity

Donate For Us