I have a problem to find the fastest way to check if a substring is in a string as an entire word or term. Currently, I'm using RegEx, but I need to perform thousands of verifications and RegEx is being VERY slow.
There are many ways to respond to this. The easier way to verify is substring in string
:
substring = "programming"
string = "Python is a high-level programming language"
substring in string
>>> True
In other hand, it's a naivy solution when we need to find the substring as an entire word or term:
substring = "program"
string = "Python is a high-level programming language"
substring in string
>>> True
Another solution is to split the string into a list of words and verify if the substring is in that list:
substring = "program"
string = "Python is a high-level programming language"
substring in string.split()
>>> False
Nevertheless, it doesn't work if the substring is a term. To resolve this, another solution would be to use RegEx:
import re
substring = "high-level program"
string = "Python is a high-level programming language"
re.search(r"\b{}\b".format(substring), string) != None
>>> False
However, my biggest problem is that the solution is REALLY slow if you need to perform thousands of verifications.
To mitigate this issue, I created some approaches that, although they are faster than RegEx (for the use I need), still are a lot slower than substring in string
:
substring = "high-level program"
string = "Python is a high-level programming language"
all([word in string.split() for word in substring.split()])
>>> False
Although simple, the above approach didn't fit because it ignores substring word order, returning True
if the substring was "programming high-level"
, unlike the solution in RegEx. So, I created another approach verifying if the substring is in a ngram list where each ngram has the same number of words as the substring:
from nltk import ngrams
substring = "high-level program"
string = "Python is a high-level programming language"
ngram = list(ngrams(string.split(), len(substring.split())))
substring in [" ".join(tuples) for tuples in ngram]
>>> False
EDIT: Here is a less slow version, working with the same principle, but using only built-in functions:
substring = "high-level program"
string = "Python is a high-level programming language"
length = len(substring.split())
words = string.split()
ngrams = [" ".join(words[i:i+length]) for i in range(len(words) - length)]
substring in ngrams
>>> False
Someone knows some a faster approach to find a substring inside a string as an entire word or term?
Simply loop through the string and splice the string according to the substring length and compare the splice string with the substring if it is equal return True.
Illustration*
strs = "Coding"
substr = "ding"
slen = 4
i = 0
check = strs[i:slen+i]==substr
# 1st iteration
strs[0:4+0] == ding
codi == ding # False
# 2nd iteration
i=1
strs[1:4+1] == ding
odin == ding # False
# 3rd iteration
i=2
strs [2:4+2] == ding
ding == ding # True
Solution
def str_exist(string, substring, slen):
for i in range(len(string)):
if string[i:slen+i] == substring:
return True
return False
substring = "high-level program"
string = "Python is a high-level programming language"
slen = len(substring)
print(str_exist(string, substring, slen))
OUTPUT
True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With