How to create substrings efficiently

Tags:

Given a string, typically a sentence, I want to extract all substrings of lengths 3, 4, 5, 6. How can I achieve this efficiently using only Python's standard library? Here is my approach, I am looking for one which is faster. To me it seems the three outer loops are inevitable either way, but maybe there is a low-level optimized solution with itertools or so.

import time

def naive(test_sentence, start, end):
    grams = []
    for word in test_sentence:
        for size in range(start, end):
            for i in range(len(word)):
                k = word[i:i+size]
                if len(k)==size:
                    grams.append(k)
    return grams

n = 10**6
start, end = 3, 7
test_sentence = "Hi this is a wonderful test sentence".split(" ")

start_time = time.time()
for _ in range(n):
    naive(test_sentence, start, end)
end_time = time.time()

print(f"{end-start} seconds for naive approach")

Output of naive():

['thi', 'his', 'this', 'won', 'ond', 'nde', 'der', 'erf', 'rfu', 'ful', 'wond', 'onde', 'nder', 'derf', 'erfu', 'rful', 'wonde', 'onder', 'nderf', 'derfu', 'erful', 'wonder', 'onderf', 'nderfu', 'derful', 'tes', 'est', 'test', 'sen', 'ent', 'nte', 'ten', 'enc', 'nce', 'sent', 'ente', 'nten', 'tenc', 'ence', 'sente', 'enten', 'ntenc', 'tence', 'senten', 'entenc', 'ntence']

Second version:

def naive2(test_sentence,start,end):
    grams = []
    for word in test_sentence:
        if len(word) >= start:
            for size in range(start,end):
                for i in range(len(word)-size+1):
                    grams.append(word[i:i+size])
    return grams

562

asked Dec 09 '21 20:12

Marcel Braasch

1 Answers

Well, I think this is not possible to improve the algorithm, but you can micro-optimize the function:

def naive3(test_sentence,start,end):
    rng = range(start,end)
    return [word[i:i+size] for word in test_sentence
                           if len(word) >= start
                           for size in rng
                           for i in range(len(word)+1-size)]

Python 3.8 introduces assignment Expressions that are quite useful for performance. Thus if you can use a recent version, then you can write:

def naive4(test_sentence,start,end):
    rng = range(start,end)
    return [word[i:i+size] for word in test_sentence 
                           if (lenWord := len(word)+1) > start
                           for size in rng
                           for i in range(lenWord-size)]

Here are performance results:

naive2: 8.28 µs ±  55 ns per call
naive3: 7.28 µs ± 124 ns per call
naive4: 6.86 µs ±  48 ns per call    (20% faster than naive2)

Note that half of the time of naive4 is spent in creating the word[i:i+size] string objects and the rest is mainly spent in the CPython interpreter (mainly due to the creation/reference-counting/deletion of variable-sized integer objects).

183

answered Oct 12 '22 03:10

Jérôme Richard

Related questions
                            
                                log4j temporary fix in elasticsearch helm chart using Dlog4j2.formatMsgNoLookups
                            
                                How to avoid Log4J exploit? [closed]
                            
                                Vue: Best approach to access possible existing nested properties in error object
                            
                                Hot to figure out which factor level has been mapped to which fill color on a barplot in R?
                            
                                404 error while adding lambda trigger in cognito user pool
                            
                                THREE.JS & Reality Capture - Rotation issue photogrammetry reference camera's in a 3D space
                            
                                How to create a company specific parent dependency in gradle
                            
                                JavaScript - Empty an Array / Object while Passing by Reference Issue
                            
                                How to pass particular property to all child components in react?
                            
                                How to get the image links from Gutenberg block gallery and add them as html data attributes to a button in wordpress
                            
                                Python generic type that implements protocol
                            
                                Compiler warning for statement on same line as #endif

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create substrings efficiently

Tags:

performance

python

Marcel Braasch

People also ask

1 Answers

Jérôme Richard

Recent Activity

Donate For Us