What is the fastest way to check if a substring is in a string as an entire word or term, like RegEx with boundaries?

Question

I have a problem to find the fastest way to check if a substring is in a string as an entire word or term. Currently, I'm using RegEx, but I need to perform thousands of verifications and RegEx is being VERY slow.

There are many ways to respond to this. The easier way to verify is substring in string:

substring = "programming"
string = "Python is a high-level programming language"

substring in string

>>> True

In other hand, it's a naivy solution when we need to find the substring as an entire word or term:

substring = "program"
string = "Python is a high-level programming language"

substring in string

>>> True

Another solution is to split the string into a list of words and verify if the substring is in that list:

substring = "program"
string = "Python is a high-level programming language"

substring in string.split()

>>> False

Nevertheless, it doesn't work if the substring is a term. To resolve this, another solution would be to use RegEx:

import re

substring = "high-level program"
string = "Python is a high-level programming language"

re.search(r"\b{}\b".format(substring), string) != None

>>> False

However, my biggest problem is that the solution is REALLY slow if you need to perform thousands of verifications.

To mitigate this issue, I created some approaches that, although they are faster than RegEx (for the use I need), still are a lot slower than substring in string:

substring = "high-level program"
string = "Python is a high-level programming language"

all([word in string.split() for word in substring.split()])

>>> False

Although simple, the above approach didn't fit because it ignores substring word order, returning True if the substring was "programming high-level", unlike the solution in RegEx. So, I created another approach verifying if the substring is in a ngram list where each ngram has the same number of words as the substring:

from nltk import ngrams

substring = "high-level program"
string = "Python is a high-level programming language"

ngram = list(ngrams(string.split(), len(substring.split())))

substring in [" ".join(tuples) for tuples in ngram]

>>> False

EDIT: Here is a less slow version, working with the same principle, but using only built-in functions:

substring = "high-level program"
string = "Python is a high-level programming language"

length = len(substring.split())
words = string.split()
ngrams = [" ".join(words[i:i+length]) for i in range(len(words) - length)]

substring in ngrams

>>> False

Someone knows some a faster approach to find a substring inside a string as an entire word or term?

Maxwell D. Dorliea · Accepted Answer

Simply loop through the string and splice the string according to the substring length and compare the splice string with the substring if it is equal return True.

Illustration*

strs = "Coding"
substr = "ding"
slen = 4
i = 0

check = strs[i:slen+i]==substr

# 1st iteration
strs[0:4+0] == ding
codi == ding # False

# 2nd iteration
i=1
strs[1:4+1] == ding
odin == ding # False

# 3rd iteration
i=2
strs [2:4+2] == ding
ding == ding # True

Solution

def str_exist(string, substring, slen):
    for i in range(len(string)):
        if string[i:slen+i] == substring:
             return True
    return False

substring = "high-level program"
string = "Python is a high-level programming language"
slen = len(substring)

print(str_exist(string, substring, slen))

OUTPUT

True

What is the fastest way to check if a substring is in a string as an entire word or term, like RegEx with boundaries?

Tags:

python

string

substring

regex

contains

Eduardo Ramon Resser

1 Answers

Maxwell D. Dorliea

Recent Activity

Donate For Us

What is the fastest way to check if a substring is in a string as an entire word or term, like RegEx with boundaries?

Tags:

python

string

substring

regex

contains

Eduardo Ramon Resser

1 Answers

Maxwell D. Dorliea

Related questions

Recent Activity

Donate For Us