Finding multiple substrings in a string without iterating over it multiple times

Question

I need to find if items from a list appear in a string, and then add the items to a different list. This code works:

data =[]
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...] 
for i in _legal:
    if i in line:
        data.append(i)

However, the code iterates over line (which could be long) multiple times- as many times as there are item in _legal (which could be a lot). That's too slow for me, and I'm searching for a way to do it faster. line doesn't have any specific format, so using .split() couldn't work, as far as I know. Edit: changed line so that it better represents the problems.

SomeDude · Accepted Answer

One way I could think of to improve is:

Get all unique lengths of the words in _legal
Build a dictionary of words from line of those particular lengths using a sliding window technique. The complexity should be O( len(line)*num_of_unique_lengths ), this should be better than brute force.
Now look for each thing in the dictionary in O(1).

Code:

line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
_legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
ul = {len(i) for i in _legal}
s=set()
for l in ul:
    s = s.union({line[i:i+l] for i in range(len(line)-l)})
print(s.intersection(set(_legal)))

Output:

{'thing1', 'fj la', 'thing2', 't5', '5'}

S3DEV · Answer

One approach is to build a very simple regex pattern, and use re.findall() to find/extract any matched words in the string.

import re

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4']

exp = re.compile(r'|'.join(_legal), re.IGNORECASE)
exp.findall(line)

>>> ['Thing1', 'Thing2']

Finding multiple substrings in a string without iterating over it multiple times

Tags:

python

string

full-text-search

Yarden Gal

2 Answers

SomeDude

S3DEV

Recent Activity

Donate For Us

Finding multiple substrings in a string without iterating over it multiple times

Tags:

python

string

full-text-search

Yarden Gal

2 Answers

SomeDude

S3DEV

Related questions

Recent Activity

Donate For Us