Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way in Python to find a 'startswith' substring in a long sorted list of strings

I've done a lot of Googling, but haven't found anything, so I'm really sorry if I'm just searching for the wrong things.

I am writing an implementation of the Ghost for MIT Introduction to Programming, assignment 5.

As part of this, I need to determine whether a string of characters is the start of any valid word. I have a list of valid words ("wordlist").

Update: I could use something that iterated through the list each time, such as Peter's simple suggestion:

def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)

I previously had:

wordlist = [w for w in wordlist if w.startswith(word_fragment)]

(from here) to narrow the list down to the list of valid words that start with that fragment and consider it a loss if wordlist is empty. The reason that I took this approach was that I (incorrectly, see below) thought that this would save time, as subsequent lookups would only have to search a smaller list.

It occurred to me that this is going through each item in the original wordlist (38,000-odd words) checking the start of each. This seems silly when wordlist is ordered, and the comprehension could stop once it hits something that is after the word fragment. I tried this:

newlist = []
for w in wordlist:
    if w[:len(word_fragment)] > word_fragment:
        # Take advantage of the fact that the list is sorted
        break
    if w.startswith(word_fragment):
        newlist.append(w)
return newlist

but that is about the same speed, which I thought may be because list comprehensions run as compiled code?

I then thought that more efficient again would be some form of binary search in the list to find the block of matching words. Is this the way to go, or am I missing something really obvious?

Clearly it isn't really a big deal in this case, but I'm just starting out with programming and want to do things properly.

UPDATE:

I have since tested the below suggestions with a simple test script. While Peter's binary search/bisect would clearly be better for a single run, I was interested in whether the narrowing list would win over a series of fragments. In fact, it did not:

The totals for all strings "p", "py", "pyt", "pyth", "pytho" are as follows:
In total, Peter's simple test took 0.175472736359
In total, Peter's bisect left test took 9.36985015869e-05
In total, the list comprehension took 0.0499348640442
In total, Neil G's bisect took 0.000373601913452

The overhead of creating a second list etc clearly took more time than searching the longer list. In hindsight, this was likely the best approach regardless, as the "reducing list" approach increased the time for the first run, which was the worst case scenario.

Thanks all for some excellent suggestions, and well done Peter for the best answer!!!

like image 738
Hooloovoo Avatar asked Jul 17 '11 09:07

Hooloovoo


People also ask

How do you check for multiple Startswith in Python?

To check if a given string starts with any of multiple prefixes , you can use the any(iterable) function that returns True if at least one of the values in the iterable evaluates to True . You can check each prefix against the string by using a generator expression like so: any(s. startswith(x) for x in prefixes) .

How do you find all occurrences of a substring in Python?

finditer() The finditer function of the regex library can help us perform the task of finding the occurrences of the substring in the target string and the start function can return the resultant index of each of them.

Is there a Startswith function in Python?

startswith() method returns a boolean. It returns True if the string starts with the specified prefix. It returns False if the string doesn't start with the specified prefix.

Is Startswith a string method in Python?

Definition and Usage. The startswith() method returns True if the string starts with the specified value, otherwise False.


1 Answers

Generator expressions are evaluated lazily, so if you only need to determine whether or not your word is valid, I would expect the following to be more efficient since it doesn't necessarily force it to build the full list once it finds a match:

def word_exists(wordlist, word_fragment):
    return any(w.startswith(word_fragment) for w in wordlist)

Note that the lack of square brackets is important for this to work.

However this is obviously still linear in the worst case. You're correct that binary search would be more efficient; you can use the built-in bisect module for that. It might look something like this:

from bisect import bisect_left
def word_exists(wordlist, word_fragment):
    try:
        return wordlist[bisect_left(wordlist, word_fragment)].startswith(word_fragment)
    except IndexError:
        return False # word_fragment is greater than all entries in wordlist

bisect_left runs in O(log(n)) so is going to be considerably faster for a large wordlist.

Edit: I would guess that the example you gave loses out if your word_fragment is something really common (like 't'), in which case it probably spends most of its time assembling a large list of valid words, and the gain from only having to do a partial scan of the list is negligible. Hard to say for sure, but it's a little academic since binary search is better anyway.

like image 83
Peter Avatar answered Oct 10 '22 22:10

Peter