Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create substrings efficiently

Given a string, typically a sentence, I want to extract all substrings of lengths 3, 4, 5, 6. How can I achieve this efficiently using only Python's standard library? Here is my approach, I am looking for one which is faster. To me it seems the three outer loops are inevitable either way, but maybe there is a low-level optimized solution with itertools or so.

import time

def naive(test_sentence, start, end):
    grams = []
    for word in test_sentence:
        for size in range(start, end):
            for i in range(len(word)):
                k = word[i:i+size]
                if len(k)==size:
                    grams.append(k)
    return grams

n = 10**6
start, end = 3, 7
test_sentence = "Hi this is a wonderful test sentence".split(" ")

start_time = time.time()
for _ in range(n):
    naive(test_sentence, start, end)
end_time = time.time()

print(f"{end-start} seconds for naive approach")

Output of naive():

['thi', 'his', 'this', 'won', 'ond', 'nde', 'der', 'erf', 'rfu', 'ful', 'wond', 'onde', 'nder', 'derf', 'erfu', 'rful', 'wonde', 'onder', 'nderf', 'derfu', 'erful', 'wonder', 'onderf', 'nderfu', 'derful', 'tes', 'est', 'test', 'sen', 'ent', 'nte', 'ten', 'enc', 'nce', 'sent', 'ente', 'nten', 'tenc', 'ence', 'sente', 'enten', 'ntenc', 'tence', 'senten', 'entenc', 'ntence']

Second version:

def naive2(test_sentence,start,end):
    grams = []
    for word in test_sentence:
        if len(word) >= start:
            for size in range(start,end):
                for i in range(len(word)-size+1):
                    grams.append(word[i:i+size])
    return grams
like image 562
Marcel Braasch Avatar asked Dec 09 '21 20:12

Marcel Braasch


People also ask

How do you get all substrings of a string in O N?

Finding all substrings of a string is O(n2) (by finding a substring I mean determining its begin and end indexes), it's easy to see because the total number of substrings is O(n2). But printing all of them out is O(n3), simply because total number of characters to be printed is O(n3).

How many Substrings are in a string of length n?

Approach: The count of sub-strings of length n will always be len – n + 1 where len is the length of the given string.

How do you find all substrings in length K?

If the length of a string is N, then there can be N – K + 1 substring of length K. Generating these substrings will require O(N) complexity, and checking each substring requires O(K) complexity, hence making the overall complexity like O(N*K).


1 Answers

Well, I think this is not possible to improve the algorithm, but you can micro-optimize the function:

def naive3(test_sentence,start,end):
    rng = range(start,end)
    return [word[i:i+size] for word in test_sentence
                           if len(word) >= start
                           for size in rng
                           for i in range(len(word)+1-size)]

Python 3.8 introduces assignment Expressions that are quite useful for performance. Thus if you can use a recent version, then you can write:

def naive4(test_sentence,start,end):
    rng = range(start,end)
    return [word[i:i+size] for word in test_sentence 
                           if (lenWord := len(word)+1) > start
                           for size in rng
                           for i in range(lenWord-size)]

Here are performance results:

naive2: 8.28 µs ±  55 ns per call
naive3: 7.28 µs ± 124 ns per call
naive4: 6.86 µs ±  48 ns per call    (20% faster than naive2)

Note that half of the time of naive4 is spent in creating the word[i:i+size] string objects and the rest is mainly spent in the CPython interpreter (mainly due to the creation/reference-counting/deletion of variable-sized integer objects).

like image 183
Jérôme Richard Avatar answered Oct 12 '22 03:10

Jérôme Richard