Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sorting algorithm where pairwise-comparison can return more information than -1, 0, +1

Tags:

Most sort algorithms rely on a pairwise-comparison the determines whether A < B, A = B or A > B.

I'm looking for algorithms (and for bonus points, code in Python) that take advantage of a pairwise-comparison function that can distinguish a lot less from a little less or a lot more from a little more. So perhaps instead of returning {-1, 0, 1} the comparison function returns {-2, -1, 0, 1, 2} or {-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5} or even a real number on the interval (-1, 1).

For some applications (such as near sorting or approximate sorting) this would enable a reasonable sort to be determined with less comparisons.

like image 404
James Tauber Avatar asked May 27 '09 02:05

James Tauber


1 Answers

The extra information can indeed be used to minimize the total number of comparisons. Calls to the super_comparison function can be used to make deductions equivalent to a great number of calls to a regular comparsion function. For example, a much-less-than b and c little-less-than b implies a < c < b.

The deductions cans be organized into bins or partitions which can each be sorted separately. Effectively, this is equivalent to QuickSort with n-way partition. Here's an implementation in Python:

from collections import defaultdict from random import choice  def quicksort(seq, compare):     'Stable in-place sort using a 3-or-more-way comparison function'     # Make an n-way partition on a random pivot value     segments = defaultdict(list)     pivot = choice(seq)     for x in seq:         ranking = 0 if x is pivot else compare(x, pivot)         segments[ranking].append(x)     seq.clear()      # Recursively sort each segment and store it in the sequence     for ranking, segment in sorted(segments.items()):         if ranking and len(segment) > 1:             quicksort(segment, compare)         seq += segment  if __name__ == '__main__':     from random import randrange     from math import log10      def super_compare(a, b):         'Compare with extra logarithmic near/far information'         c = -1 if a < b else 1 if a > b else 0         return c * (int(log10(max(abs(a - b), 1.0))) + 1)      n = 10000     data = [randrange(4*n) for i in range(n)]     goal = sorted(data)     quicksort(data, super_compare)     print(data == goal) 

By instrumenting this code with the trace module, it is possible to measure the performance gain. In the above code, a regular three-way compare uses 133,000 comparisons while a super comparison function reduces the number of calls to 85,000.

The code also makes it easy to experiment with a variety comparison functions. This will show that naïve n-way comparison functions do very little to help the sort. For example, if the comparison function returns +/-2 for differences greater than four and +/-1 for differences four or less, there is only a modest 5% reduction in the number of comparisons. The root cause is that the course grained partitions used in the beginning only have a handful of "near matches" and everything else falls in "far matches".

An improvement to the super comparison is to covers logarithmic ranges (i.e. +/-1 if within ten, +/-2 if within a hundred, +/- if within a thousand.

An ideal comparison function would be adaptive. For any given sequence size, the comparison function should strive to subdivide the sequence into partitions of roughly equal size. Information theory tells us that this will maximize the number of bits of information per comparison.

The adaptive approach makes good intuitive sense as well. People should first be partitioned into love vs like before making more refined distinctions such as love-a-lot vs love-a-little. Further partitioning passes should each make finer and finer distinctions.

like image 69
Raymond Hettinger Avatar answered Sep 18 '22 13:09

Raymond Hettinger