I am working on two large data sets, and my question is as follows.
Suppose I have two lists:
list1 = [A,B,C,D]
list2 = [B,D,A,G]
How can I efficiently find the matching index, using Python, other than O(n2) searching? The result should look like:
matching_index(list1,list2) -> [(0,2),(1,0),(3,1)]
Sort & Compare to check if two lists are equal If both are of different size then it means lists are not equal. Whereas, if both lists are of same size, then created a sorted versions of both the lists and compared them using == operator to check if lists are equal or not.
One of the most basic ways to get the index positions of all occurrences of an element in a Python list is by using a for loop and the Python enumerate function. The enumerate function is used to iterate over an object and returns both the index and element.
By using the list. index() method, we can easily get the element index value from list. In this example we have define a list of integer values and uses the list. index() method we can get the index of the item whose value is '210'.
If your objects are hashable and your lists have no duplicates, you can create an inverted index of the first list and then traverse the second list. This traverses each list only once and thus is O(n)
.
def find_matching_index(list1, list2):
inverse_index = { element: index for index, element in enumerate(list1) }
return [(index, inverse_index[element])
for index, element in enumerate(list2) if element in inverse_index]
find_matching_index([1,2,3], [3,2,1]) # [(0, 2), (1, 1), (2, 0)]
You can extend the previous solution to account for duplicates. You can keep track of multiple indices with a set
.
def find_matching_index(list1, list2):
# Create an inverse index which keys are now sets
inverse_index = {}
for index, element in enumerate(list1):
if element not in inverse_index:
inverse_index[element] = {index}
else:
inverse_index[element].add(index)
# Traverse the second list
matching_index = []
for index, element in enumerate(list2):
# We have to create one pair by element in the set of the inverse index
if element in inverse_index:
matching_index.extend([(x, index) for x in inverse_index[element]])
return matching_index
find_matching_index([1, 1, 2], [2, 2, 1]) # [(2, 0), (2, 1), (0, 2), (1, 2)]
Unfortunately, this is no longer O(n). Consider the case where you input [1, 1]
and [1, 1]
, the output is [(0, 0), (0, 1), (1, 0), (1, 1)]
. Thus by the size of the output, the worst case cannot be better than O(n^2)
.
Although, this solution is still O(n)
if there are no duplicates.
Now comes the case where your objects are not hashable, but comparable. The idea here will be to sort your lists in a way that preserves the origin index of each element. Then we can group sequences of elements that are equal to get matching indices.
Since we make heavy use of groupby
and product
in the following code, I made find_matching_index
return a generator for memory efficiency on long lists.
from itertools import groupby, product
def find_matching_index(list1, list2):
sorted_list1 = sorted((element, index) for index, element in enumerate(list1))
sorted_list2 = sorted((element, index) for index, element in enumerate(list2))
list1_groups = groupby(sorted_list1, key=lambda pair: pair[0])
list2_groups = groupby(sorted_list2, key=lambda pair: pair[0])
for element1, group1 in list1_groups:
try:
element2, group2 = next(list2_groups)
while element1 > element2:
(element2, _), group2 = next(list2_groups)
except StopIteration:
break
if element2 > element1:
continue
indices_product = product((i for _, i in group1), (i for _, i in group2), repeat=1)
yield from indices_product
# In version prior to 3.3, the above line must be
# for x in indices_product:
# yield x
list1 = [[], [1, 2], []]
list2 = [[1, 2], []]
list(find_matching_index(list1, list2)) # [(0, 1), (2, 1), (1, 0)]
It turns out that time complexity does not suffer that much. Sorting of course takes O(n log(n))
, but then groupby
provides generators that can recover all elements by traversing our lists only twice. The conclusion is that our complexity is primarly bound by the size of the output of product
. Thus giving a best case where the algorithm is O(n log(n))
and a worst case that is once again O(n^2)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With