Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What sorting techniques can I use when comparing elements is expensive?

Tags:

Problem

I have an application where I want to sort an array a of elements a0, a1,...,an-1. I have a comparison function cmp(i,j) that compares elements ai and aj and a swap function swap(i,j), that swaps elements ai and aj of the array. In the application, execution of the cmp(i,j) function might be extremely expensive, to the point where one execution of cmp(i,j) takes longer than any other steps in the sort (except for other cmp(i,j) calls, of course) together. You may think of cmp(i,j) as a rather lengthy IO operation.

Please assume for the sake of this question that there is no way to make cmp(i,j) faster. Assume all optimizations that could possibly make cmp(i,j) faster have already been done.

Questions

  • Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?

  • It is possible in my application to write a predicate expensive(i,j) that is true iff a call to cmp(i,j) would take a long time. expensive(i,j) is cheap and expensive(i,j) ∧ expensive(j,k) → expensive(i,k) mostly holds in my current application. This is not guaranteed though.

    Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?

  • I'd like pointers to further material on this topic.

Example

This is an example that is not entirely unlike the application I have.

Consider a set of possibly large files. In this application the goal is to find duplicate files among them. This essentially boils down to sorting the files by some arbitrary criterium and then traversing them in order, outputting sequences of equal files that were encountered.

Of course reader in large amounts of data is expensive, therefor one can, for instance, only read the first megabyte of each file and calculate a hash function on this data. If the files compare equal, so do the hashes, but the reverse may not hold. Two large file could only differ in one byte near the end.

The implementation of expensive(i,j) in this case is simply a check whether the hashes are equal. If they are, an expensive deep comparison is neccessary.

like image 552
fuz Avatar asked Aug 22 '13 13:08

fuz


1 Answers

I'll try to answer each question as best as I can.

  • Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?

Traditional sorting methods may have some variation, but in general, there is a mathematical limit to the minimum number of comparisons necessary to sort a list, and most algorithms take advantage of that, since comparisons are often not inexpensive. You could try sorting by something else, or try using a shortcut that may be faster that may approximate the real solution.

  • Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?

I don't think you can get around the necessity of doing at least the minimum number of comparisons, but you may be able to change what you compare. If you can compare hashes or subsets of the data instead of the whole thing, that could certainly be helpful. Anything you can do to simplify the comparison operation will make a big difference, but without knowing specific details of the data, it's hard to suggest specific solutions.

  • I'd like pointers to further material on this topic.

Check these out:

  • Apparently Donald Knuth's The Art of Computer Programming, Volume 3 has a section on this topic, but I don't have a copy handy.
  • Wikipedia of course has some insight into the matter.
  • Sorting an array with minimal number of comparisons
  • How do I figure out the minimum number of swaps to sort a list in-place?
  • Limitations of comparison based sorting techniques
like image 149
pattivacek Avatar answered Oct 29 '22 13:10

pattivacek