Remove substrings inside a list with better than O(n^2) complexity

Question

I have a list with many words (100.000+), and what I'd like to do is remove all the substrings of every word in the list.

So for simplicity, let's imagine that I have the following list:

words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']

The following output is the desired:

['Hello', 'Apple', 'Banana', 'Peter']

'Hell' was removed because it is a substring of 'Hello'
'Ban' was removed because it is a substring of 'Banana'
'P' was removed because it is a substring of 'Peter'
'e' was removed because it is a substring of 'Hello', 'Hell', 'Apple', and so on.

What I've done

This is my code, but I am wondering if there is a more efficient way than these nested comprehensions.

to_remove = [x for x in words for y in words if x != y and x in y]
output = [x for x in words if x not in to_remove]

How can I improve the performance? Should I use regex instead?

btilly · Accepted Answer

@wim is correct.

Given an alphabet of fixed length, the following algorithm is linear in the overall length of text. If the alphabet is of unbounded size, then it will be O(n log(n)) instead. Either way it is better than O(n^2).

Create an empty suffix tree T.
Create an empty list filtered_words
For word in words:
    if word not in T:
        Build suffix tree S for word (using Ukkonen's algorithm)
        Merge S into T
        append word to filtered_words

pawamoy · Answer

Build the set of all (unique) substrings first, then filter the words with it:

def substrings(s):
    length = len(s)
    return {s[i:j + 1] for i in range(length) for j in range(i, length)} - {s}


def remove_substrings(words):
    subs = set()
    for word in words:
        subs |= substrings(word)

    return set(w for w in words if w not in subs)

Remove substrings inside a list with better than O(n^2) complexity

Tags:

python

string

algorithm

list

Dalvtor

2 Answers

btilly

pawamoy

Recent Activity

Donate For Us

Remove substrings inside a list with better than O(n^2) complexity

Tags:

python

string

algorithm

list

Dalvtor

2 Answers

btilly

pawamoy

Related questions

Recent Activity

Donate For Us