Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Fastest Way to compare arrays elementwise

I am looking for the fastest way to output the index of the first difference of two arrays in Python. For example, let's take the following two arrays:

test1 = [1, 3, 5, 8]
test2 = [1]
test3 = [1, 3]

Comparing test1 and test2, I would like to output 1, while the comparison of test1 and test3 should output 2.

In other words I look for an equivalent to the statement:

import numpy as np
np.where(np.where(test1 == test2, test1, 0) == '0')[0][0] 

with varying array lengths.

Any help is appreciated.

like image 228
Andy Avatar asked May 10 '15 17:05

Andy


People also ask

How do you compare two arrays in Elementwise Python?

To compare two arrays and return the element-wise minimum, use the numpy. fmin() method in Python Numpy. Return value is either True or False. Compare two arrays and returns a new array containing the element-wise maxima.

How do I compare two arrays of different sizes Python?

import numpy as np A = np. array([[1, 1], [2, 2]]) B = np. array([[1, 1], [2, 2]]) print(A == B) In this resulting matrix, each element is a result of a comparison of two corresponding elements in the two arrays.

How do I compare two array sizes?

equals() Method. Java Arrays class provides the equals() method to compare two arrays. It iterates over each value of an array and compares the elements using the equals() method.


3 Answers

For lists this works:

from itertools import zip_longest

def find_first_diff(list1, list2):
    for index, (x, y) in enumerate(zip_longest(list1, list2, 
                                               fillvalue=object())):
        if x != y:
            return index

zip_longest pads the shorter list with None or with a provided fill value. The standard zip does not work if the difference is caused by different list lengths rather than actual different values in the lists.

On Python 2 use izip_longest.

Updated: Created unique fill value to avoid potential problems with None as list value. object() is unique:

>>> o1 = object()
>>> o2 = object()
>>> o1 == o2
False

This pure Python approach might be faster than a NumPy solution. This depends on the actual data and other circumstances.

  1. Converting a list into a NumPy array also takes time. This might actually take longer than finding the index with the function above. If you are not going to use the NumPy array for other calculations, the conversion might cause considerable overhead.

  2. NumPy always searches the full array. If the difference comes early, you do a lot more work than you need to.

  3. NumPy creates a bunch of intermediate arrays. This costs memory and time.

  4. NumPy needs to construct intermediate arrays with the maximum length. Comparing many small with very large arrays is unfavorable here.

In general, in many cases NumPy is faster than a pure Python solution. But each case is a bit different and there are situations where pure Python is faster.

like image 126
Mike Müller Avatar answered Sep 29 '22 16:09

Mike Müller


with numpy arrays (which will be faster for big arrays) then you could check the lengths of the lists then (also) check the overlapping parts something like the following (obviously slicing the longer to the length of the shorter):

import numpy as np

n = min(len(test1), len(test2))
x = np.where(test1[:n] != test2[:n])[0]
if len(x) > 0:
  ans = x[0]
elif len(test1) != len(test2):
  ans = n
else:
  ans = None

EDIT - despite this being voted down I will leave my answer up here in case someone else needs to do something similar.

If the starting arrays are large and numpy then this is the fastest method. Also I had to modify Andy's code to get it to work. In the order: 1. my suggestion, 2. Paidric's (now removed but the most elegant), 3. Andy's accepted answer, 4. zip - non numpy, 5. vanilla python without zip as per @leekaiinthesky

0.1ms, 9.6ms, 0.6ms, 2.8ms, 2.3ms

if the conversion to ndarray is included in timeit then the non-numpy nop-zip method is fastest

7.1ms, 17.1ms, 7.7ms, 2.8ms, 2.3ms

and even more so if the difference between the two lists is at around index 1,000 rather than 10,000

7.1ms, 17.1ms, 7.7ms, 0.3ms, 0.2ms

import timeit

setup = """
import numpy as np
from itertools import zip_longest
list1 = [1 for i in range(10000)] + [4, 5, 7]
list2 = [1 for i in range(10000)] + [4, 4]
test1 = np.array(list1)
test2 = np.array(list2)

def find_first_diff(l1, l2):
    for index, (x, y) in enumerate(zip_longest(l1, l2, fillvalue=object())):
        if x != y:
            return index

def findFirstDifference(list1, list2):
  minLength = min(len(list1), len(list2))
  for index in range(minLength):
    if list1[index] != list2[index]:
      return index
  return minLength
"""

fn = ["""
n = min(len(test1), len(test2))
x = np.where(test1[:n] != test2[:n])[0]
if len(x) > 0:
  ans = x[0]
elif len(test1) != len(test2):
  ans = n
else:
  ans = None""",
"""
x = np.where(np.in1d(list1, list2) == False)[0]
if len(x) > 0:
  ans = x[0]
else:
  ans = None""",
"""
x = test1
y = np.resize(test2, x.shape)
x = np.where(np.where(x == y, x, 0) == 0)[0]
if len(x) > 0:
  ans = x[0]
else:
  ans = None""",
"""
ans = find_first_diff(list1, list2)""",
"""
ans = findFirstDifference(list1, list2)"""]

for f in fn:
  print(timeit.timeit(f, setup, number = 1000))
like image 31
paddyg Avatar answered Sep 29 '22 14:09

paddyg


Here one way to do it:

from itertools import izip
def compare_lists(lista, listb):
    """
    Compare two lists and return the first index where they differ. if
    they are equal, return the list len
    """
    for position, (a, b) in enumerate(zip(lista, listb)):
        if a != b:
            return position
    return min([len(lista), len(listb)])
  • The algorithm is simple: zip (or in this case, a more efficient izip) the two lists, then compare them element by element.
  • The eumerate function gives the index position which we can return if a discrepancy found
  • If we exit the for loop without any returns, one of the two possibilities can happen:

    1. The two lists are identical. In this case, we want to return the length of either lists.
    2. Lists are of different length and they are equal up to the length of the shorter list. In this case, we want to return the length of the shorter list

    In ether cases, the min(...) expression is what we want.

  • This function has a bug: if you compare two empty lists, it returns 0, which seems wrong. I'll leave it to you to fix it as an exercise.
like image 42
Hai Vu Avatar answered Sep 29 '22 15:09

Hai Vu