Given a string, typically a sentence, I want to extract all substrings of lengths 3, 4, 5, 6
. How can I achieve this efficiently using only Python's standard library? Here is my approach, I am looking for one which is faster. To me it seems the three outer loops are inevitable either way, but maybe there is a low-level optimized solution with itertools
or so.
import time
def naive(test_sentence, start, end):
grams = []
for word in test_sentence:
for size in range(start, end):
for i in range(len(word)):
k = word[i:i+size]
if len(k)==size:
grams.append(k)
return grams
n = 10**6
start, end = 3, 7
test_sentence = "Hi this is a wonderful test sentence".split(" ")
start_time = time.time()
for _ in range(n):
naive(test_sentence, start, end)
end_time = time.time()
print(f"{end-start} seconds for naive approach")
Output of naive()
:
['thi', 'his', 'this', 'won', 'ond', 'nde', 'der', 'erf', 'rfu', 'ful', 'wond', 'onde', 'nder', 'derf', 'erfu', 'rful', 'wonde', 'onder', 'nderf', 'derfu', 'erful', 'wonder', 'onderf', 'nderfu', 'derful', 'tes', 'est', 'test', 'sen', 'ent', 'nte', 'ten', 'enc', 'nce', 'sent', 'ente', 'nten', 'tenc', 'ence', 'sente', 'enten', 'ntenc', 'tence', 'senten', 'entenc', 'ntence']
Second version:
def naive2(test_sentence,start,end):
grams = []
for word in test_sentence:
if len(word) >= start:
for size in range(start,end):
for i in range(len(word)-size+1):
grams.append(word[i:i+size])
return grams
Finding all substrings of a string is O(n2) (by finding a substring I mean determining its begin and end indexes), it's easy to see because the total number of substrings is O(n2). But printing all of them out is O(n3), simply because total number of characters to be printed is O(n3).
Approach: The count of sub-strings of length n will always be len – n + 1 where len is the length of the given string.
If the length of a string is N, then there can be N – K + 1 substring of length K. Generating these substrings will require O(N) complexity, and checking each substring requires O(K) complexity, hence making the overall complexity like O(N*K).
Well, I think this is not possible to improve the algorithm, but you can micro-optimize the function:
def naive3(test_sentence,start,end):
rng = range(start,end)
return [word[i:i+size] for word in test_sentence
if len(word) >= start
for size in rng
for i in range(len(word)+1-size)]
Python 3.8 introduces assignment Expressions that are quite useful for performance. Thus if you can use a recent version, then you can write:
def naive4(test_sentence,start,end):
rng = range(start,end)
return [word[i:i+size] for word in test_sentence
if (lenWord := len(word)+1) > start
for size in rng
for i in range(lenWord-size)]
Here are performance results:
naive2: 8.28 µs ± 55 ns per call
naive3: 7.28 µs ± 124 ns per call
naive4: 6.86 µs ± 48 ns per call (20% faster than naive2)
Note that half of the time of naive4
is spent in creating the word[i:i+size]
string objects and the rest is mainly spent in the CPython interpreter (mainly due to the creation/reference-counting/deletion of variable-sized integer objects).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With