Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Binary String Search - minimum bin width?

I happen to be building the binary search in Python, but the question has more to do with binary search structure in general.

Let's assume I have about one thousand eligible candidates I am searching through using binary search, doing the classic approach of bisecting the sorted dataset and repeating this process in order to narrow down the eligible set to iterate over. The candidates are just strings of names,(first-last format, eg "Peter Jackson") I initially sort the set alphabetically and then proceed with bisection using something like this:

hi = len(names)
lo = 0
while lo < hi:
  mid = (lo+hi)//2
  midval = names[mid].lower()
  if midval < query.lower():
    lo = mid+1
  elif midval > query.lower():
    hi=mid
  else:
    return midval
return None

This code adapted from here: https://stackoverflow.com/a/212413/215608

Here's the thing, the above procedure assumes a single exact match or no result at all. What if the query was merely for a "Peter", but there are several peters with differing last names? In order to return all the Peters, one would have to ensure that the bisected "bins" never got so small as to except eligible results. The bisection process would have to cease and cede to something like a regex/regular old string match in order to return all the Peters.

I'm not so much asking how to accomplish this as what this type of search is called... what is a binary search with a delimited criteria for "bin size" called? Something that conditionally bisects the dataset, and once the criteria is fulfilled, falls back to some other form of string matching in order to ensure that there can effectively be a ending wildcard on the query (so a search for a "Peter" will get "Peter Jacksons" and "Peter Edwards")

Hopefully I've been clear what I mean. I realize in the typical DB scenario the names might be separated, this is just intended as a proof of concept.

like image 840
DeaconDesperado Avatar asked Nov 04 '22 08:11

DeaconDesperado


1 Answers

I've not come across this type of two-stage search before, so don't know whether it has a well-known name. I can, however, propose a method for how it can be carried out.

Let say you've run the first stage and have found no match.

You can perform the second stage with a pair of binary searches and a special comparator. The binary searches would use the same principle as bisect_left and bisect_right. You won't be able to use those functions directly since you'll need a special comparator, but you can use them as the basis for your implementation.

Now to the comparator. When comparing the list element x against the search key k, the comparator would only use x[:len(k)] and ignore the rest of x. Thus when searching for "Peter", all Peters in the list would compare equal to the key. Consequently, bisect_left() to bisect_right() would give you the range containing all Peters in the list.

All of this can be done using O(log n) comparisons.

like image 133
NPE Avatar answered Nov 08 '22 08:11

NPE