Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interleave different length lists, elimating duplicates, and preserve order

I have two lists, let's say:

keys1 = ['A', 'B', 'C', 'D', 'E',           'H', 'I'] keys2 = ['A', 'B',           'E', 'F', 'G', 'H',      'J', 'K'] 

How do I create a merged list without duplicates that preserve the order of both lists, inserting the missing elements where they belong? Like so:

merged = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K'] 

Note that the elements can be compared against equality but not ordered (they are complex strings). The elements can't be ordered by comparing them, but they have an order based on their occurrence in the original lists.

In case of contradiction (different order in both input lists), any output containing all elements is valid. Of course with bonus points if the solution shows 'common sense' in preserving most of the order.

Again (as some comments still argue about it), the lists normally don't contradict each other in terms of the order of the common elements. In case they do, the algorithm needs to handle that error gracefully.

I started with a version that iterates over the lists with .next() to advance just the list containing the unmatched elements, but .next() just doesn't know when to stop.

merged = [] L = iter(keys1) H = iter(keys2) l = L.next() h = H.next()  for i in range(max(len(keys1, keys2))):   if l == h:     if l not in merged:       merged.append(l)     l = L.next()     h = H.next()    elif l not in keys2:     if l not in merged:       merged.append(l)     l = L.next()    elif h not in keys1:     if h not in merged:       merged.append(h)     h = H.next()    else: # just in case the input is badly ordered     if l not in merged:       merged.append(l)     l = L.next()     if h not in merged:       merged.append(h)     h = H.next()     print merged 

This obviously doesn't work, as .next() will cause an exception for the shortest list. Now I could update my code to catch that exception every time I call .next(). But the code already is quite un-pythonic and this would clearly burst the bubble.

Does anyone have a better idea of how to iterate over those lists to combine the elements?

Bonus points if I can do it for three lists in one go.

like image 873
Chaos_99 Avatar asked Jan 09 '13 16:01

Chaos_99


2 Answers

What you need is basically what any merge utility does: It tries to merge two sequences, while keeping the relative order of each sequence. You can use Python's difflib module to diff the two sequences, and merge them:

from difflib import SequenceMatcher  def merge_sequences(seq1,seq2):     sm=SequenceMatcher(a=seq1,b=seq2)     res = []     for (op, start1, end1, start2, end2) in sm.get_opcodes():         if op == 'equal' or op=='delete':             #This range appears in both sequences, or only in the first one.             res += seq1[start1:end1]         elif op == 'insert':             #This range appears in only the second sequence.             res += seq2[start2:end2]         elif op == 'replace':             #There are different ranges in each sequence - add both.             res += seq1[start1:end1]             res += seq2[start2:end2]     return res 

Example:

>>> keys1 = ['A', 'B', 'C', 'D', 'E',           'H', 'I'] >>> keys2 = ['A', 'B',           'E', 'F', 'G', 'H',      'J', 'K'] >>> merge_sequences(keys1, keys2) ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K'] 

Note that the answer you expect is not necessarily the only possible one. For example, if we change the order of sequences here, we get another answer which is just as valid:

>>> merge_sequences(keys2, keys1) ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'I'] 
like image 184
interjay Avatar answered Sep 22 '22 16:09

interjay


I suspect that you may be asking for a solution to the shortest common supersequence problem, which I believe is NP-hard in the general case of an arbitrary number of input sequences. I'm not aware of any libraries for solving this problem, so you might have to implement one by hand. Probably the quickest way to get to working code would be to take interjay's answer using difflib and then use reduce to run it on an arbitrary number of lists (make sure to specify the empty list as the 3rd argument to reduce).

like image 43
Ryan C. Thompson Avatar answered Sep 21 '22 16:09

Ryan C. Thompson