Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a sorting algorithm that is robust to a faulty comparison?

I want to sort a list of n items with a comparison sort. However, one of the comparisons made by the algorithm will be flipped from what it's supposed to be. Specifically, there is one pair of items for which the comparator function consistently gives the wrong result.

What is a efficient n*log(n) sorting algorithm that will be robust to this faulty comparison? By robust, I mean that every item is off by at most k spots from its true position, for some reasonably small k.

If possible, I'd like it to be robust in the worst case (faulty comparison chosen adversarially), but I'll settle for robust in the average case.

An example robust algorithm (that's not efficient), would be to make all n*(n-1)/2 pairwise comparisons, and place each item by how many of the comparisons they won. Then, no matter what comparison the adversary makes, each items index will be off by no more than k=1.

An example of a NON-robust algorithm is quicksort, because the adversary could just choose the largest item to be on the wrong side of the first pivot, making it on average n/2 spots off from its correct index.

like image 813
chausies Avatar asked Jul 06 '21 14:07

chausies


1 Answers

TL;DR: It's possible to modify quicksort to get the following guarantee: in (expected) time O(n log n), we can do one of the following, depending on which comparison is flipped.

  • Perfectly sort the array.
  • Perfectly sort the array, except that an adjacent pair of items somewhere in the array is swapped.
  • Perfectly sort the array, except that three consecutive items in the array, which can be identified, are permuted.

This guarantees a maximum displacement of 2, which is as good as is theoretically possible.


I mulled over this problem for a couple of hours and everything I'm doing connects back to tournaments.

I'd like to begin by trying to reframe the question as follows. If you have a set of n items and you know the "true" results of the comparisons between them, you can represent that result as a directed graph with one node per item and edges indicating when one item compares less than another. This type of digraph is called a "tournament," since you can think of it as encoding the result of a round-robin tournament where each player plays each other player.

In the case of an honest comparator, our tournament will be acyclic, and in particular it will have the following key property: there's exactly one node of each outdegree 0, 1, 2, ..., n - 1. The idea here is that the smallest element will have outdegree n - 1 (it's smaller than everything else), while the largest element will have outdegree 0 (it's bigger than everything else). And in fact, there's a theorem that a tournament is acyclic if and only if each node in the tournament has a different outdegree. Another useful fact: in an acyclic tournament, there's an edge from U to V if and only if outdeg(U) > outdeg(V).

In the case of a "dishonest comparator," we essentially start with an acyclic tournament, then flip a single edge. Your question asked about doing approximate sorting based on this comparator, but I'd like to step back and ask a different question, which I think can then be used to answer yours more precisely. In what cases can you figure out which edge was flipped? If we can do that, then we can do even better than approximate sorting - we can "unflip" the edge and sort perfectly. On the other hand, in which cases can you not figure out which edge was flipped, and when that happens, how far from sorted will we end up? That corresponds to having to do an approximate sort because we can't recover the original ordering.

Here's a useful fact:

Theorem: Begin with an acyclic tournament and flip a single edge. Then it's possible to determine which edge was flipped if and only if the outdegrees of the two endpoints of the flipped edge originally differ by at least three.

To prove this, we'll show both directions of implication.

First, suppose that we flip an edge between two nodes X and Y whose outdegrees differ by one. When we're done, we're left with a tournament where all nodes have different outdegrees (all other nodes have their outdegrees unchanged, and if we flipped the edge (X, Y), then X and Y swap outdegrees because one goes up by one and one goes down by one). We're now left with another acyclic tournament. And in particular, we can't tell which edge we flipped, because we could have just as well flipped any edge between any pair of nodes whose outdegrees differ by one.

Next, suppose we flip an edge between nodes X and Y where the outdeg(X) = k+1 and outdeg(Y) = k-1. We now have outdeg(X) = k = outdeg(Y), and somewhere else to begin with there must have been some node Z with outdegree k as well. So at this point, we have three nodes of outdegree k (namely, X, Y, and Z), and we know that we must have flipped one of the three edges between them. But we can't tell which one it was. Specifically, flipping the XY edge, or the XZ edge, or the YZ edge would all give back acyclic tournaments. So in that case, there's no way to undo the transform. That means that any sorted ordering we get from this comparator will have those two items out of place, so we'd have a maximum distance of at least 1.

An important note for this particular case: this corresponds to the comparator creating a tournament with exactly one cycle containing the nodes X, Y, and Z. Specifically, it'll take on the form X, Z, Y, X. The problem is we can't tell whether the original ordering was (X, Z, Y), or (Z, Y, X), or (Y, X, Z), and so we'd have a maximum distance of at least 2.

And finally, suppose that we have two nodes X and Y and flip the edge XY in the case where outdeg(X) = k, outdeg(Y) = m, and k ≥ m + 3. We're now left with a tournament in which two nodes have outdegree k - 1 and two nodes have outdegree m + 1. But of those four nodes, it's guaranteed that there's exactly one pair of them that can be flipped back to produce an acyclic tournament. One way to see this: take the four nodes that now have repeated outdegrees; call them X and Y (as above) and also W and Z, and suppose we have the cycle X, W, Z, Y, X, where the only flipped edge from the original is (Y, X). What will this cycle look like? Well, since (X, W), (W, Z), and (Z, Y) are edges in the tournament that weren't flipped, back in the original tournament we have outdeg(X) > outdeg(W) > outdeg(Z) > outdeg(Y). That means that we have to have X and W having outdegree k - 1 in the new graph and Z and Y having outdegree m + 1 in the new graph. Therefore, only flipping the edge from Y to X will increase the degree of one of the degree-(k-1) nodes back up to k while also decreasing the degree of one of the degree-(m+1) nodes down to m.

Summarizing:

Theorem: The faulty comparator will either

  1. Behave as a real comparator, in which case we swapped two adjacent elements in the original sequence and we will never know which.
  2. Have exactly one cycle of length three of elements whose original ordering can never be known, or
  3. Have a cycle of length four, in which case we can identify which comparison is reversed.

With this in mind, it seems reasonable to reframe your problem in the following way:

Goal: Design an algorithm that, in time O(n log n), does one of the following things to a list of n elements given a faulty comparator that returns the wrong result when comparing two fixed elements X and Y against one another:

  1. Perfectly sort the list.
  2. Perfectly sort the list, except with two adjacent items swapped.
  3. Perfectly sort the list, except with three adjacent items permuted.

Here's one possible algorithm that does this in expected O(n log n) time that's based on quicksort. The basic idea is the following: we run more or less a regular quicksort, at each point in time checking to see whether we found a triangle. If not, then either we're in case (1) or case (2). If we do find a triangle, we see whether we can identify which comparison got reversed. If we can, then we rerun quicksort, except that we "fix" the comparator in this broken case. If we can't, then we're in case (3) and just finish quicksort as usual.

The specific technique we'll use to detect a triangle works like this. Begin with a regular, vanilla quicksort: pick a pivot, partition the array into things less than the pivot and things bigger than the pivot, then recursively sort the two smaller subarrays. However, after doing so, we do one additional step: assuming the subarray we're sorting has three or more elements in it, look at the pivot p and the element just before and just after it (call those s, p, g for "smaller," "pivot," and "greater"). Then if the comparator says s < p < g < s, we've found a triangle. And in fact, we have something stronger.

Suppose that at some point in quicksort comparator does indeed compare X and Y, the mismatched items. We're assuming X < Y, but that the comparator incorrectly reports that Y < X. The only way that two items can be compared in quicksort is if one of them is a pivot element at a time when the other is in the current subarray. Without loss of generality, let's assume that X was the pivot, and that Y was compared against it.

What should happen here, assuming the comparator was honest, is that Y would be found to be larger than X, and therefore would be placed into the "bigger" subarray. But because the comparator is a lying liar who lies, instead Y gets placed into the "smaller" subarray. If we then recursively sort the "smaller" subarray and the "bigger" subarray, think about where Y will end up. It's in the "smaller" subarray but is actually bigger than X, which means it'll compare larger than everything in that "smaller" subarray. Consequently, Y will appear just before X. Now, look at the items in the "bigger" subarray. There are two options. The first is that in the "real" ordering, there's at least one value between X and Y. That value would then appear in the "bigger" subarray because it's larger than X, and in particular the first element of the "bigger" subarray would compare smaller than Y. That would mean that Y, then X, then the item immediately after X after sorting would form a triangle. The other option is that X and Y are adjacent in the true sorted ordering, which case we'd never find out (as mentioned above). This, combined with the above insight, means that

Theorem: Suppose we run quicksort, and after recursively sorting the left and right subarrays we look at the three items consisting of the pivot, the item just before it, and the item just after it to see if they form a triangle. Then if this algorithm detects a triangle, a triangle exists. Moreover, if this algorithm does not detect a triangle, then either (1) no triangle exists or (2) a triangle does exist, but the comparator was never applied to the bad pair (X, Y) and so the sorted order is correct.

With all this said and done, we can state the full algorithm that, in expected O(n log n) time, sorts the array as best as is possible.

function modifiedQuicksort(array, comparator):
    if array has length 0 or 1, return.
    
    pick a random pivot element from the array.

    use the comparator to form subarrays smaller and greater based on
       how elements compare against the pivot.

    recursively apply modifiedQuicksort to those two arrays.

    if the comparator finds a triangle formed from the last element of
       smaller, the pivot, and the first element of greater, report those
       three items as a triangle.

    return smaller, pivot, greater.

function sortAsBestWeCan(array, comparator):
    run modifiedQuicksort(array, comparator)

    if it didn't report a triangle, return the result of the call.

    otherwise, it reported a triangle A, B, C.

    for each other item D:
        if comparator(A, D) and comparator(D, B)   or
           comparator(B, D) and comparator(D, C)   or
           comparator(C, D) and comparator(D, A):

            you have found a 4-cycle from A, B, C, and D.

            detect which comparison is reversed.

            use that knowledge plus the comparator and your favorite
                O(n log n)-time sorting algorithm to perfectly sort
                the input array.

    otherwise, those three items are the only triangle, and the
       array is sorted as well as it can be. return it.
like image 117
templatetypedef Avatar answered Sep 23 '22 14:09

templatetypedef