Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to check if a substring is in a string as an entire word or term, like RegEx with boundaries?

I have a problem to find the fastest way to check if a substring is in a string as an entire word or term. Currently, I'm using RegEx, but I need to perform thousands of verifications and RegEx is being VERY slow.

There are many ways to respond to this. The easier way to verify is substring in string:

substring = "programming"
string = "Python is a high-level programming language"

substring in string

>>> True

In other hand, it's a naivy solution when we need to find the substring as an entire word or term:

substring = "program"
string = "Python is a high-level programming language"

substring in string

>>> True

Another solution is to split the string into a list of words and verify if the substring is in that list:

substring = "program"
string = "Python is a high-level programming language"

substring in string.split()

>>> False

Nevertheless, it doesn't work if the substring is a term. To resolve this, another solution would be to use RegEx:

import re

substring = "high-level program"
string = "Python is a high-level programming language"

re.search(r"\b{}\b".format(substring), string) != None

>>> False

However, my biggest problem is that the solution is REALLY slow if you need to perform thousands of verifications.

To mitigate this issue, I created some approaches that, although they are faster than RegEx (for the use I need), still are a lot slower than substring in string:

substring = "high-level program"
string = "Python is a high-level programming language"

all([word in string.split() for word in substring.split()])

>>> False

Although simple, the above approach didn't fit because it ignores substring word order, returning True if the substring was "programming high-level", unlike the solution in RegEx. So, I created another approach verifying if the substring is in a ngram list where each ngram has the same number of words as the substring:

from nltk import ngrams

substring = "high-level program"
string = "Python is a high-level programming language"

ngram = list(ngrams(string.split(), len(substring.split())))

substring in [" ".join(tuples) for tuples in ngram]

>>> False

EDIT: Here is a less slow version, working with the same principle, but using only built-in functions:

substring = "high-level program"
string = "Python is a high-level programming language"

length = len(substring.split())
words = string.split()
ngrams = [" ".join(words[i:i+length]) for i in range(len(words) - length)]

substring in ngrams

>>> False

Someone knows some a faster approach to find a substring inside a string as an entire word or term?

like image 347
Eduardo Ramon Resser Avatar asked Oct 17 '25 06:10

Eduardo Ramon Resser


1 Answers

Simply loop through the string and splice the string according to the substring length and compare the splice string with the substring if it is equal return True.

Illustration*

strs = "Coding"
substr = "ding"
slen = 4
i = 0

check = strs[i:slen+i]==substr

# 1st iteration
strs[0:4+0] == ding
codi == ding # False

# 2nd iteration
i=1
strs[1:4+1] == ding
odin == ding # False

# 3rd iteration
i=2
strs [2:4+2] == ding
ding == ding # True

Solution

def str_exist(string, substring, slen):
    for i in range(len(string)):
        if string[i:slen+i] == substring:
             return True
    return False

substring = "high-level program"
string = "Python is a high-level programming language"
slen = len(substring)

print(str_exist(string, substring, slen))

OUTPUT

True
like image 67
Maxwell D. Dorliea Avatar answered Oct 19 '25 20:10

Maxwell D. Dorliea



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!