I'm trying to generate a list of all overlapping n-length substrings in a given string.
For example, for an n of 6
and the string "hereismystring"
I would generate the list ["hereis", "ereism", "reismy", ..., "string"]
. The trivial code I'm using right now looks like this:
n = 6
l = len(string)
substrings = [string[i:(i + n)] for i in xrange(l - n + 1)]
Easy enough. Problem is, I'd like to speed this up (I have very many very long strings). Is there a faster technique in Python? Will dropping down to Cython help at all given that Python's string routines are in C anyhow?
For reference, this technique takes about 100us on my machine (a new Macbook Pro) for a 500-length string and an n of 30.
Thanks for the help in advance!
Taking a step back from the issue of which Python coding technique will be the fastest, I would approach the problem differently. Since all the strings are the same length, and all come from a single source string, why not simply work with the ranges of characters directly, rather than convert them into proper strings? You would avoid a lot of allocation and copying, but you would have to adjust your code to know that each "string" is n characters long.
In other words, just read ranges from the source string directly when you want to work with a substring. You'll be working with the characters you want as fast as they can be pulled from cache. You could express a "substring" as merely an offset into the source string.
Sometimes if you want ultra-fast performance you have to leave familiar data structures behind. Just a thought.
How about:
>>> d = deque("hereismystring")
>>> s = ''.join(d)[:6]
>>> while not len(s) % 6:
... print s
... _ = d.popleft()
... s = ''.join(d)[:6]
...
hereis
ereism
reismy
eismys
ismyst
smystr
mystri
ystrin
string
>>>
I believe deque is O(1) while lists are O(n)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With