Python: Identifying and deleting duplicate sequences in list

Question

I'm looking for the best way to find duplicate sequences in a list. A sequence is defined as at least two neighbouring values.

Example: In the following list, the repeating sequence should be identified and deleted.

a = [45874,
    35195, # <-
    28965,
    05867,
    25847, # <-
    94937,
    64894,
    55535,
    62899,
    00391,
    35195, # Duplicate Sequence Start
    28965,
    05867,
    25847, # Duplicate Sequence End
    08483,
    55801,
    33129,
    42616]

I can't wrap my head around any solution, so any help is much appreciated!

Olivier Melançon · Accepted Answer

Finding a single subsequence of length m in a string of length n can be done in no less than O(nm) with Boyer-Moore algorithm. Finding all duplicate subsequences will most likely be more than that. Although, if all you want is to delete the subsequences and not find them, there is a trick.

Deleting duplicate subsequences in O(n)

We only need to concern ourselves with subsequences of length 2, because any sequence can be expressed as overlapping subsequences of length 2. This observation allows us to keep a count of such pairs and then remove them from right to left.

In particular, this requires traversing the list only a fixed amount of times and is thus O(n).

from collections import Counter

def remove_duplicates(lst):
    dropped_indices = set()
    counter = Counter(tuple(lst[i:i+2]) for i in range(len(lst) - 1))

    for i in range(len(lst) - 2, -1, -1):
        sub = tuple(lst[i:i+2])
        if counter[sub] > 1:
            dropped_indices |= {i, i + 1}
            counter[sub] -= 1

    return [x for i, x in enumerate(lst) if i not in dropped_indices]

Here is an example based on the provided input.

a = [45874,
     35195, # This
     28965, # Is
     5867,  # A
     25847, # Subsequence
     94937,
     64894,
     55535,
     62899,
     391,
     35195, # Those
     28965, # Should
     5867,  # Be
     25847, # Removed
     8483]

b = remove_duplicates(a)
# Output:
#   [45874,
#    35195,
#    28965,
#    5867,
#    25847,
#    94937,
#    64894,
#    55535,
#    62899,
#    391,
#            <- Here the duplicate was removed
#    8483]

Python: Identifying and deleting duplicate sequences in list

Tags:

python

list

sequences

subsequence

Vingtoft

1 Answers

Deleting duplicate subsequences in O(n)

Olivier Melançon

Recent Activity

Donate For Us

Python: Identifying and deleting duplicate sequences in list

Tags:

python

list

sequences

subsequence

Vingtoft

1 Answers

Deleting duplicate subsequences in O(n)

Olivier Melançon

Related questions

Recent Activity

Donate For Us