What's the most efficient method to remove a list of substrings from a string?
I'd like a cleaner, quicker way to do the following:
words = 'word1 word2 word3 word4, word5'
replace_list = ['word1', 'word3', 'word5']
def remove_multiple_strings(cur_string, replace_list):
for cur_word in replace_list:
cur_string = cur_string.replace(cur_word, '')
return cur_string
remove_multiple_strings(words, replace_list)
To remove multiple characters from a string we can easily use the function str. replace and pass a parameter multiple characters. The String class (Str) provides a method to replace(old_str, new_str) to replace the sub-strings in a string. It replaces all the elements of the old sub-string with the new sub-string.
Regex:
>>> import re
>>> re.sub(r'|'.join(map(re.escape, replace_list)), '', words)
' word2 word4, '
The above one-liner is actually not as fast as your string.replace
version, but definitely shorter:
>>> words = ' '.join([hashlib.sha1(str(random.random())).hexdigest()[:10] for _ in xrange(10000)])
>>> replace_list = words.split()[:1000]
>>> random.shuffle(replace_list)
>>> %timeit remove_multiple_strings(words, replace_list)
10 loops, best of 3: 49.4 ms per loop
>>> %timeit re.sub(r'|'.join(map(re.escape, replace_list)), '', words)
1 loops, best of 3: 623 ms per loop
Gosh! Almost 12x slower.
But can we improve it? Yes.
As we are only concerned with words what we can do is simply filter out words from the words
string using \w+
and compare it against a set of replace_list
(yes an actual set
: set(replace_list)
):
>>> def sub(m):
return '' if m.group() in s else m.group()
>>> %%timeit
s = set(replace_list)
re.sub(r'\w+', sub, words)
...
100 loops, best of 3: 7.8 ms per loop
For even larger string and words the string.replace
approach and my first solution will end up taking quadratic time, but the solution should run in linear time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With