Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perform a binary search for a string prefix in Python

I want to search a sorted list of strings for all of the elements that start with a given substring.

Here's an example that finds all of the exact matches:

import bisect
names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
names.sort()
leftIndex = bisect.bisect_left(names, 'bob')
rightIndex = bisect.bisect_right(names, 'bob')
print(names[leftIndex:rightIndex])

Which prints ['bob', 'bob', 'bob'].

Instead, I want to search for all the names that start with 'bob'. The output I want is ['bob', 'bob', 'bob', 'bobby', 'bobert']. If I could modify the comparison method of the bisect search, then I could use name.startswith('bob') to do this.

As an example, in Java it would be easy. I would use:

Arrays.binarySearch(names, "bob", myCustomComparator);

where 'myCustomComparator' is a comparator that takes advantage of the startswith method (and some additional logic).

How do I do this in Python?

like image 891
dln385 Avatar asked Sep 11 '11 19:09

dln385


2 Answers

bisect can be fooled into using a custom comparison by using an instance that uses the custom comparator of your chosing:

>>> class PrefixCompares(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value < other[0:len(self.value)]
... 
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> key = PrefixCompares('bob')
>>> leftIndex = bisect.bisect_left(names, key)
>>> rightIndex = bisect.bisect_right(names, key)
>>> print(names[leftIndex:rightIndex])
['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert']
>>> 

DOH. the right bisect worked, but the left one obviously didn't. "adam" is not prefixed with "bob"!. to fix it, you have to adapt the sequence, too.

>>> class HasPrefix(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value[0:len(other.value)] < other.value
... 
>>> class Prefix(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value < other.value[0:len(self.value)]
... 
>>> class AdaptPrefix(object):
...     def __init__(self, seq):
...         self.seq = seq
...     def __getitem__(self, key):
...         return HasPrefix(self.seq[key])
...     def __len__(self):
...         return len(self.seq)
... 
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> needle = Prefix('bob')
>>> haystack = AdaptPrefix(names)
>>> leftIndex = bisect.bisect_left(haystack, needle)
>>> rightIndex = bisect.bisect_right(haystack, needle)
>>> print(names[leftIndex:rightIndex])
['bob', 'bob', 'bob', 'bobby', 'bobert']
>>> 
like image 105
SingleNegationElimination Avatar answered Sep 27 '22 19:09

SingleNegationElimination


Unfortunately bisect does not allow you to specify a key function. What you can do though is add '\xff\xff\xff\xff' to the string before using it to find the highest index, then take those elements.

like image 39
Ignacio Vazquez-Abrams Avatar answered Sep 27 '22 18:09

Ignacio Vazquez-Abrams