Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Identifying and deleting duplicate sequences in list

I'm looking for the best way to find duplicate sequences in a list. A sequence is defined as at least two neighbouring values.

Example: In the following list, the repeating sequence should be identified and deleted.

a = [45874,
    35195, # <-
    28965,
    05867,
    25847, # <-
    94937,
    64894,
    55535,
    62899,
    00391,
    35195, # Duplicate Sequence Start
    28965,
    05867,
    25847, # Duplicate Sequence End
    08483,
    55801,
    33129,
    42616]

I can't wrap my head around any solution, so any help is much appreciated!

like image 970
Vingtoft Avatar asked Feb 24 '26 04:02

Vingtoft


1 Answers

Finding a single subsequence of length m in a string of length n can be done in no less than O(nm) with Boyer-Moore algorithm. Finding all duplicate subsequences will most likely be more than that. Although, if all you want is to delete the subsequences and not find them, there is a trick.

Deleting duplicate subsequences in O(n)

We only need to concern ourselves with subsequences of length 2, because any sequence can be expressed as overlapping subsequences of length 2. This observation allows us to keep a count of such pairs and then remove them from right to left.

In particular, this requires traversing the list only a fixed amount of times and is thus O(n).

from collections import Counter

def remove_duplicates(lst):
    dropped_indices = set()
    counter = Counter(tuple(lst[i:i+2]) for i in range(len(lst) - 1))

    for i in range(len(lst) - 2, -1, -1):
        sub = tuple(lst[i:i+2])
        if counter[sub] > 1:
            dropped_indices |= {i, i + 1}
            counter[sub] -= 1

    return [x for i, x in enumerate(lst) if i not in dropped_indices]

Here is an example based on the provided input.

a = [45874,
     35195, # This
     28965, # Is
     5867,  # A
     25847, # Subsequence
     94937,
     64894,
     55535,
     62899,
     391,
     35195, # Those
     28965, # Should
     5867,  # Be
     25847, # Removed
     8483]

b = remove_duplicates(a)
# Output:
#   [45874,
#    35195,
#    28965,
#    5867,
#    25847,
#    94937,
#    64894,
#    55535,
#    62899,
#    391,
#            <- Here the duplicate was removed
#    8483]
like image 120
Olivier Melançon Avatar answered Feb 25 '26 18:02

Olivier Melançon