Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is substring searching using 'in' operator, faster than using KMP algorithm?

I have come across the KMP algorithm for substring searching and implemented it in python. Later, I found that the in operator can also be used for this problem and I decided to compare their performances. To my surprise, in was much faster than the KMP algorithm and I decided to take a closer look at in.

I found that in implements __contains__ method in a string which is for containment check as suggested in Datamodel doc. But couldn't get further as to why it's faster.

Here's my implementation of KMP algorithm:

def lps(pattern):
    start_ind = 0
    lps_list = [0]
    for j in pattern[1:]:
        if(j == pattern[start_ind]):
            lps_list.append(start_ind)
            start_ind += 1
        else:
            start_ind = 0
            lps_list.append(start_ind)
    return lps_list
def kmp(search, pattern):
    lps_list = lps(pattern)
    pat_ind = 0
    for j in search:
        if(pat_ind == len(pattern)):
            return True
        if(j == pattern[pat_ind]):
            pat_ind += 1
            continue
        while(j != pattern[pat_ind] and pat_ind != 0):
            pat_ind = lps_list[pat_ind - 1]
        if(pat_ind == 0 and j == pattern[pat_ind]):
                pat_ind += 1
    else:
        if(pat_ind == len(pattern)):
            return True
        return False

The Driver block:

start = timeit.default_timer()
print('Found!!') if(kmp(search, pattern)) else print('Nope')
print(f'KMP algorithm: {(timeit.default_timer() - start):0.8e}')

start = timeit.default_timer()
print('Found!!') if(pattern in search) else print('Nope')
print(f'in operator: {(timeit.default_timer() - start):0.8e}')

Test case and result:

search = ''.join(['a' for _ in range(10000)] + ['b'])
pattern = ''.join(['a' for _ in range(1000)] + ['b'])
Found!!
KMP algorithm: 4.42536000e-03
Found!!
in operator: 3.72060003e-05

I expected the KMP algorithm to be not as slow as the results suggest since it is a decent substring searching algorithm. I couldn't understand if it is, me missing something with my test cases or due to Python's way of storing strings.

like image 946
Hyouin Kyouma Avatar asked Apr 06 '19 06:04

Hyouin Kyouma


1 Answers

As suggested in the comments to the question, the internal string searching algorithm is faster in the best cases but also slower in the worst cases than KMP algorithm. The worst cases being rarer. Here's some background, and the source code of its implementation. And the fact that it's implemented in C makes it faster. Below is a worst case for str.__contains__ and result:

search = 'a' * 10000000 + 'ba'
pattern = 'a' * 10000 + 'ba'
Found!!
KMP algorithm: 3.29188600e+00
Found!!
in operator: 3.85637655e+01
like image 167
2 revs Avatar answered Nov 03 '22 10:11

2 revs