Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to create a list of overlapping substrings from a string in Python

I'm trying to generate a list of all overlapping n-length substrings in a given string.

For example, for an n of 6 and the string "hereismystring" I would generate the list ["hereis", "ereism", "reismy", ..., "string"]. The trivial code I'm using right now looks like this:

n = 6
l = len(string)
substrings = [string[i:(i + n)] for i in xrange(l - n + 1)]

Easy enough. Problem is, I'd like to speed this up (I have very many very long strings). Is there a faster technique in Python? Will dropping down to Cython help at all given that Python's string routines are in C anyhow?

For reference, this technique takes about 100us on my machine (a new Macbook Pro) for a 500-length string and an n of 30.

Thanks for the help in advance!

like image 559
Nick Avatar asked Oct 22 '22 18:10

Nick


2 Answers

Taking a step back from the issue of which Python coding technique will be the fastest, I would approach the problem differently. Since all the strings are the same length, and all come from a single source string, why not simply work with the ranges of characters directly, rather than convert them into proper strings? You would avoid a lot of allocation and copying, but you would have to adjust your code to know that each "string" is n characters long.

In other words, just read ranges from the source string directly when you want to work with a substring. You'll be working with the characters you want as fast as they can be pulled from cache. You could express a "substring" as merely an offset into the source string.

Sometimes if you want ultra-fast performance you have to leave familiar data structures behind. Just a thought.

like image 165
Randall Cook Avatar answered Oct 24 '22 12:10

Randall Cook


How about:

>>> d = deque("hereismystring")
>>> s = ''.join(d)[:6]
>>> while not len(s) % 6:
...    print s
...    _ = d.popleft()
...    s = ''.join(d)[:6]
... 
hereis
ereism
reismy
eismys
ismyst
smystr
mystri
ystrin
string
>>> 

I believe deque is O(1) while lists are O(n)

like image 43
Burhan Khalid Avatar answered Oct 24 '22 14:10

Burhan Khalid