Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Runtime of python's if substring in string

Tags:

What is the big O of the following if statement?

if "pl" in "apple":    ... 

What is the overall big O of how python determines if the string "pl" is found in the string "apple"

or any other substring in string search.

Is this the most efficient way to test if a substring is in a string? Does it use the same algorithm as .find()?

like image 805
Liondancer Avatar asked Feb 05 '16 09:02

Liondancer


People also ask

How do you check in Python if a string contains a substring?

The in Operator It returns a Boolean (either True or False ). To check if a string contains a substring in Python using the in operator, we simply invoke it on the superstring: fullstring = "StackAbuse" substring = "tack" if substring in fullstring: print("Found!") else: print("Not found!")

What is the runtime of string slicing in Python?

Short answer: str slices, in general, copy. That means that your function that does a slice for each of your string's n suffixes is doing O(n2) work. That said, you can avoid copies if you can work with bytes -like objects using memoryview s to get zero-copy views of the original bytes data.

How do you check if a string contains a substring?

You can use contains(), indexOf() and lastIndexOf() method to check if one String contains another String in Java or not. If a String contains another String then it's known as a substring. The indexOf() method accepts a String and returns the starting position of the string if it exists, otherwise, it will return -1.

How many times does a substring appear in a string Python?

Python String count() The count() method returns the number of occurrences of a substring in the given string.


1 Answers

The time complexity is O(N) on average, O(NM) worst case (N being the length of the longer string, M, the shorter string you search for). As of Python 3.10, heuristics are used to lower the worst-case scenario to O(N + M) by switching algorithms.

The same algorithm is used for str.index(), str.find(), str.__contains__() (the in operator) and str.replace(); it is a simplification of the Boyer-Moore with ideas taken from the Boyer–Moore–Horspool and Sunday algorithms.

See the original stringlib discussion post, as well as the fastsearch.h source code; until Python 3.10, the base algorithm has not changed since introduction in Python 2.5 (apart from some low-level optimisations and corner-case fixes).

The post includes a Python-code outline of the algorithm:

def find(s, p):     # find first occurrence of p in s     n = len(s)     m = len(p)     skip = delta1(p)[p[m-1]]     i = 0     while i <= n-m:         if s[i+m-1] == p[m-1]: # (boyer-moore)             # potential match             if s[i:i+m-1] == p[:m-1]:                 return i             if s[i+m] not in p:                 i = i + m + 1 # (sunday)             else:                 i = i + skip # (horspool)         else:             # skip             if s[i+m] not in p:                 i = i + m + 1 # (sunday)             else:                 i = i + 1     return -1 # not found 

as well as speed comparisons.

In Python 3.10, the algorithm was updated to use an enhanced version of the Crochemore and Perrin's Two-Way string searching algorithm for larger problems (with p and s longer than 100 and 2100 characters, respectively, with s at least 6 times as long as p), in response to a pathological edgecase someone reported. The commit adding this change included a write-up on how the algorithm works.

The Two-way algorithm has a worst-case time complexity of O(N + M), where O(M) is a cost paid up-front to build a shift table from the s search needle. Once you have that table, this algorithm does have a best-case performance of O(N/M).

like image 73
Martijn Pieters Avatar answered Sep 21 '22 05:09

Martijn Pieters