I need to find if items from a list appear in a string, and then add the items to a different list. This code works:
data =[]
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...]
for i in _legal:
if i in line:
data.append(i)
However, the code iterates over line
(which could be long) multiple times- as many times as there are item in _legal
(which could be a lot). That's too slow for me, and I'm searching for a way to do it faster. line
doesn't have any specific format, so using .split()
couldn't work, as far as I know.
Edit: changed line
so that it better represents the problems.
One way I could think of to improve is:
_legal
line
of those particular lengths using a sliding window technique. The complexity should be O( len(line)*num_of_unique_lengths )
, this should be better than brute force.thing
in the dictionary in O(1).Code:
line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
_legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
ul = {len(i) for i in _legal}
s=set()
for l in ul:
s = s.union({line[i:i+l] for i in range(len(line)-l)})
print(s.intersection(set(_legal)))
Output:
{'thing1', 'fj la', 'thing2', 't5', '5'}
One approach is to build a very simple regex pattern, and use re.findall()
to find/extract any matched words in the string.
import re
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4']
exp = re.compile(r'|'.join(_legal), re.IGNORECASE)
exp.findall(line)
>>> ['Thing1', 'Thing2']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With