I have come across the KMP algorithm for substring searching and implemented it in python. Later, I found that the in
operator can also be used for this problem and I decided to compare their performances. To my surprise, in
was much faster than the KMP algorithm and I decided to take a closer look at in
.
I found that in
implements __contains__
method in a string which is for containment check as suggested in Datamodel doc. But couldn't get further as to why it's faster.
Here's my implementation of KMP algorithm:
def lps(pattern):
start_ind = 0
lps_list = [0]
for j in pattern[1:]:
if(j == pattern[start_ind]):
lps_list.append(start_ind)
start_ind += 1
else:
start_ind = 0
lps_list.append(start_ind)
return lps_list
def kmp(search, pattern):
lps_list = lps(pattern)
pat_ind = 0
for j in search:
if(pat_ind == len(pattern)):
return True
if(j == pattern[pat_ind]):
pat_ind += 1
continue
while(j != pattern[pat_ind] and pat_ind != 0):
pat_ind = lps_list[pat_ind - 1]
if(pat_ind == 0 and j == pattern[pat_ind]):
pat_ind += 1
else:
if(pat_ind == len(pattern)):
return True
return False
The Driver block:
start = timeit.default_timer()
print('Found!!') if(kmp(search, pattern)) else print('Nope')
print(f'KMP algorithm: {(timeit.default_timer() - start):0.8e}')
start = timeit.default_timer()
print('Found!!') if(pattern in search) else print('Nope')
print(f'in operator: {(timeit.default_timer() - start):0.8e}')
Test case and result:
search = ''.join(['a' for _ in range(10000)] + ['b'])
pattern = ''.join(['a' for _ in range(1000)] + ['b'])
Found!!
KMP algorithm: 4.42536000e-03
Found!!
in operator: 3.72060003e-05
I expected the KMP algorithm to be not as slow as the results suggest since it is a decent substring searching algorithm. I couldn't understand if it is, me missing something with my test cases or due to Python's way of storing strings.
As suggested in the comments to the question, the internal string searching algorithm is faster in the best cases but also slower in the worst cases than KMP algorithm. The worst cases being rarer. Here's some background, and the source code of its implementation. And the fact that it's implemented in C makes it faster.
Below is a worst case for str.__contains__
and result:
search = 'a' * 10000000 + 'ba'
pattern = 'a' * 10000 + 'ba'
Found!!
KMP algorithm: 3.29188600e+00
Found!!
in operator: 3.85637655e+01
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With