I am looking for the fastest way to output the index of the first difference of two arrays in Python. For example, let's take the following two arrays:
test1 = [1, 3, 5, 8]
test2 = [1]
test3 = [1, 3]
Comparing test1
and test2
, I would like to output 1
, while the comparison of test1 and test3 should output 2
.
In other words I look for an equivalent to the statement:
import numpy as np
np.where(np.where(test1 == test2, test1, 0) == '0')[0][0]
with varying array lengths.
Any help is appreciated.
To compare two arrays and return the element-wise minimum, use the numpy. fmin() method in Python Numpy. Return value is either True or False. Compare two arrays and returns a new array containing the element-wise maxima.
import numpy as np A = np. array([[1, 1], [2, 2]]) B = np. array([[1, 1], [2, 2]]) print(A == B) In this resulting matrix, each element is a result of a comparison of two corresponding elements in the two arrays.
equals() Method. Java Arrays class provides the equals() method to compare two arrays. It iterates over each value of an array and compares the elements using the equals() method.
For lists this works:
from itertools import zip_longest
def find_first_diff(list1, list2):
for index, (x, y) in enumerate(zip_longest(list1, list2,
fillvalue=object())):
if x != y:
return index
zip_longest
pads the shorter list with None
or with a provided fill value. The standard zip
does not work if the difference is caused by different list lengths rather than actual different values in the lists.
On Python 2 use izip_longest
.
Updated: Created unique fill value to avoid potential problems with None
as list value. object()
is unique:
>>> o1 = object()
>>> o2 = object()
>>> o1 == o2
False
This pure Python approach might be faster than a NumPy solution. This depends on the actual data and other circumstances.
Converting a list into a NumPy array also takes time. This might actually take longer than finding the index with the function above. If you are not going to use the NumPy array for other calculations, the conversion might cause considerable overhead.
NumPy always searches the full array. If the difference comes early, you do a lot more work than you need to.
NumPy creates a bunch of intermediate arrays. This costs memory and time.
NumPy needs to construct intermediate arrays with the maximum length. Comparing many small with very large arrays is unfavorable here.
In general, in many cases NumPy is faster than a pure Python solution. But each case is a bit different and there are situations where pure Python is faster.
with numpy arrays (which will be faster for big arrays) then you could check the lengths of the lists then (also) check the overlapping parts something like the following (obviously slicing the longer to the length of the shorter):
import numpy as np
n = min(len(test1), len(test2))
x = np.where(test1[:n] != test2[:n])[0]
if len(x) > 0:
ans = x[0]
elif len(test1) != len(test2):
ans = n
else:
ans = None
EDIT - despite this being voted down I will leave my answer up here in case someone else needs to do something similar.
If the starting arrays are large and numpy then this is the fastest method. Also I had to modify Andy's code to get it to work. In the order: 1. my suggestion, 2. Paidric's (now removed but the most elegant), 3. Andy's accepted answer, 4. zip - non numpy, 5. vanilla python without zip as per @leekaiinthesky
0.1ms, 9.6ms, 0.6ms, 2.8ms, 2.3ms
if the conversion to ndarray is included in timeit then the non-numpy nop-zip method is fastest
7.1ms, 17.1ms, 7.7ms, 2.8ms, 2.3ms
and even more so if the difference between the two lists is at around index 1,000 rather than 10,000
7.1ms, 17.1ms, 7.7ms, 0.3ms, 0.2ms
import timeit
setup = """
import numpy as np
from itertools import zip_longest
list1 = [1 for i in range(10000)] + [4, 5, 7]
list2 = [1 for i in range(10000)] + [4, 4]
test1 = np.array(list1)
test2 = np.array(list2)
def find_first_diff(l1, l2):
for index, (x, y) in enumerate(zip_longest(l1, l2, fillvalue=object())):
if x != y:
return index
def findFirstDifference(list1, list2):
minLength = min(len(list1), len(list2))
for index in range(minLength):
if list1[index] != list2[index]:
return index
return minLength
"""
fn = ["""
n = min(len(test1), len(test2))
x = np.where(test1[:n] != test2[:n])[0]
if len(x) > 0:
ans = x[0]
elif len(test1) != len(test2):
ans = n
else:
ans = None""",
"""
x = np.where(np.in1d(list1, list2) == False)[0]
if len(x) > 0:
ans = x[0]
else:
ans = None""",
"""
x = test1
y = np.resize(test2, x.shape)
x = np.where(np.where(x == y, x, 0) == 0)[0]
if len(x) > 0:
ans = x[0]
else:
ans = None""",
"""
ans = find_first_diff(list1, list2)""",
"""
ans = findFirstDifference(list1, list2)"""]
for f in fn:
print(timeit.timeit(f, setup, number = 1000))
Here one way to do it:
from itertools import izip
def compare_lists(lista, listb):
"""
Compare two lists and return the first index where they differ. if
they are equal, return the list len
"""
for position, (a, b) in enumerate(zip(lista, listb)):
if a != b:
return position
return min([len(lista), len(listb)])
zip
(or in this case, a more efficient izip
) the two lists, then compare them element by element.eumerate
function gives the index position which we can return if a discrepancy foundIf we exit the for
loop without any returns, one of the two possibilities can happen:
In ether cases, the min(...)
expression is what we want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With