Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search/Sort Algorithms that sacrifice accuracy for speed

I really like to study algorithms and optimize code (I try not to do it prematurely) because it feels really cool when something that took 5 minutes to run now runs in 2 minutes. I have a special interest in search algorithms since it's so frequent when you have to search for a matching substring or entries in a table.

I was thinking about the lower bound for comparison sort and was thinking about how for gigantic data sets if the comparison sort can just skip over some comparisons by guessing which what the answer will be then an entire lines of compares could be gone and height is reduced by 1. (e.g. sorting a,b,c,d,e,f if an algorithm could guess that bcd are together then you're really only sorting a,bcd,e,f) The guess would have to be a smart, efficient guess to make it worth it plus it needs to have a fairly good batting ratio.

Same with searching, if a smart search could first guess where the item is likely to be and only takes the top 5 guessed areas to search. If all 5 guesses are wrong then it could return a wrong answer and never find the item, but if it's substantially faster with a good enough correct ratio then it might be with it. It could potentially be faster than creating a binary search tree then having log(n) searches.

Anyways, I'm sure anyone who understands the subject will have realized by now that that is mostly speculation/fantasizing with no real substance so I'm asking for help in taking steps in the direction of learning about algorithms that don't have 100% correct returns, particularly in the search/sort areas, but are faster and the application of those algorithms.

I've googled, clicked random links on wikipedia to try and find this but with no satisfactory results. What should I read/where should I go to start learning about this?

I guess I should mention I'm comfortable in most of the "standard" algorithms and data structures like quicksort, merge sort, bubble, radix, count etc. and hashes, self-balancing trees, etc.

like image 237
NorthGuard Avatar asked Jan 20 '23 12:01

NorthGuard


1 Answers

I think to accomplish much, you'd have to define some criteria for your "almost sorted". If, for example, having an element within N spots of the correct place was sufficient, you could do something like Quicksort, but stop when a partition was down to N elements. Note that it's already common to do this, and finish the job with an insertion sort. Unless N was pretty large, however, you probably wouldn't gain much from this.

As far as searching goes, you're probably looking for what's usually called an interpolation search. Instead of always guessing at the middle of a range, you use interpolation to guess at a likely spot for the item you're looking for (e.g., if you're looking for a string that starts with 'b', you start about 1/13th of the way through a collection instead of half way through.

If the items in the collection are extremely unevenly distributed, the latter may not work out particularly well, but assuming even reasonably even distribution, it tends to give extremely good results (around O(log log N) instead of the O(log N) you get with a binary search). It does, however, depend on even distribution, and having a key type for which you can compute something at least reasonably similar to a "distance", not just a "less than" or "greater than" comparison). In practice, it often works quite well though (and cases where it won't are usually pretty obvious up-front).

like image 93
Jerry Coffin Avatar answered Jan 29 '23 09:01

Jerry Coffin