Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to sort an array only using these blackbox functions?

The problem goes as follows:

As input you get an array (ARR) of size N that you do not know the contents of – and will not be able to see the contents of – and a pre-assigned value U≪N. Your solution should work for any value of U and N.

You are given a work array (W) of size N. You can also use the following helper functions which both run in O(1) time:

  • SORT(A, B): SORT will take two contiguous sections of the array ARR or W, of size |A|≤U and |B|≤U, merge them into a new array, sort them, then put the lower part of size |A| of this new array back into A, and the remaining values into B. It will return the number of values from A that remained in A.

  • COPY(A -> C): COPY will make a copy of the contents of A into C. |A| == |C|.

I already have a solution to this problem, which is a variant on merge sort and has a similar runtime of O( (N/U) log (N/U) ), but I was wondering if there was a faster solution.

Also, I was wondering how you could sort the array under the constraint that you no longer have the work array (W).


As requested, my solution (very informal and simplified)

# Define helper subroutine MERGE (merges two sorted arrays)
MERGE (array A[1...m], array B[1...n]):
  a_idx = 1
  b_idx = 1
  w_idx = 1
  while a_idx < m+1 and b_idx < n+1; do
    COPY( A[a_idx ... a_idx+U], W[w_idx ... w_idx+U])
    COPY( B[b_idx ... b_idx+U], W[w_idx+U+1 ... w_idx+2U])
    F=W[w_idx ... w_idx+U]
    G=W[w_idx+U+1 ... w_idx+2U]
    offset = SORT(F,G)
    a_idx += offset
    b_idx += (U-offset)
    w_idx += U
  end while

  if a_idx != m+1:
    # this can be done in a while loop as well but just writing it down like this for simplification
    COPY( A[a_idx ... m+1], W[w_idx ... w_idx+(m+1-a_idx)])
    w_idx += (m+1-a_idx)
  end if

  if b_idx != n+1:
    # this can be done in a while loop as well but just writing it down like this for simplification
    COPY( B[b_idx ... n+1], W[w_idx ... w_idx+(n+1-b_idx)])
    w_idx += (n+1-b_idx)
  end if
  
  COPY(W[1 ... m], A[1 ... m])
  COPY(W[m+1 ... m+n], B[1 ... n])
end MERGE

Once you have this subroutine, you can easily sort the original array into sorted pieces of size 2U using SORT, and then use merge sort logic with this routine to merge it into one whole sorted array.

like image 620
PoorProgrammer Avatar asked Jul 17 '19 14:07

PoorProgrammer


1 Answers

Your algorithm looks like a good idea, although I could not make it work yet with a practical implementation.

Time Complexity

One cannot hope to do better than O(N/U log(N/U)), because the SORT function is the only method available to get information about comparisons and perform permutations, and it can act on at most 2U data elements. It is like a SWAP function: as a SWAP orders 2 values, SORT orders 2U values.

Imagine a special-case input:

  • Its size is a multiple of U.
  • Each individual chunk (subarray) of size U, starting at an index that is a multiple of U (zero-based), is already sorted
  • The min-max range of one chunk does not overlap with that of another. So for instance, you would not have [1,3,5,2,4,6] with U=3, because then 1-5 overlaps 2-6. On the other hand [7,9,11,1,5,6] with U=3 would fit this condition

So with such input, the problem of sorting is reduced to sorting N/U chunks. Those chunks remain invariant. SORT on a pair of chunks will act like a SWAP. And that is like a normal sort on M=N/U elements, which is O(MlogM). I don't see how applying SORT to a cross-chunk range would be helpful to somehow speed up the sorting, as it would destroy the order that was already achieved in the involved chunks. So in my opinion the best you can get is O(N/U log(N/U)).

Solution without using W or COPY

One could use the ideas of Heap Sort to sort the input with just the SORT function.

The choice for Heap Sort follows from the observation that the heapify and siftDown functions (typical to heap sort) can be implemented with swaps only, without the need to temporarily keep one of the values aside.

I would consider working with "chunks" on fixed index boundaries, meaning each chunk occupies a range of [iU, (i+1)U] (indexes assumed to be zero-based, and with the value at the end-index not included in the chunk). The last chunk could be smaller because the input size may not always be a multiple of U. Each chunk would be regarded a node in the heap.

Some adjustments are still needed to Heap Sort, to make it work:

  • The heap property in a Max Heap says that a parent value should not be less than any of its child values. As we work with chunks, we could aim to ensure that all values in the parent node are not less than all values in each child node. In other words, the minimum value of the parent chunk should not be less than the maximum value of each child chunk.

  • In the siftDown procedure, Heap Sort would compare the values of the two children of the node to sift down, and choose the child with the greatest value. In our case, we have no pure comparison function, but we could use SORT to sort the two children so that their range of values do not overlap anymore. However, this could break the heap property with the grandchildren. To avoid that, we must make sure that the child that has the minimum value among the two children, keeps that minimum value after the sort. In other words, we should put that child as first argument in the call to SORT. This clearly guarantees that the heap property holds with that child's children. But also the sibling cannot possibly get a lesser minimum value than it had before SORT was called. So that child also maintains the heap property with its own children.

  • The previous point still leaves unresolved how the child can be identified that has the least value among the two children (needed to decide how to order the arguments to SORT). For that we will first make sure that each child individually has already been sorted, so that each child has its minimum value in the first slot of the chunk. Then we can call SORT with just that leftmost value (array of length 1) of the left child and of the right child. If this results in a swap the return value of SORT will be 0. In that case we call SORT again with inverted arguments in order to undo the swap. But the first return value will tell us which is the smallest value.

  • In the heapify function we will make sure that all the leaf nodes get sorted. The rest of the standard heapify method will automatically make each of the internal nodes sorted as well.

  • The actual siftDown step would compare the parent node's value with that of the greatest child and perform a swap if the child value turns out to be greater than the parent's value. In our case, this can all be done with one call to SORT (after the child was selected via the above method): pass that child chunk as first argument, and the parent chunk as second. This will guarantee the heap property between the parent and child, but also with the other child (if any), because we already established that all values in that other child were not greater than those in the selected child.

  • There is a boundary case where the last leaf will have a smaller chunk size. This can become tricky to handle when it also has a sibling, and a siftDown operation could possibly select the rightmost, shorter chunk. To avoid that this leaf would ever be a candidate for a siftDown selection, we will make sure that its sibling will be the "greater" child. This can always be done by sorting the two leaves with SORT (putting the shorter chunk as first argument), because we don't have to worry about maintaining the heap property with grandchildren (there are none). When the shorter leaf is the only child of its parent, there is no problem.

Implementation

Instead of pseudo code, I provide here an implementation in JavaScript, which I think is quite readable.

The snippet below allows to enter array values and a value for U. The result is updated in real time.

// Implementation of the SORT function (not part of the solution)
function createSortFunction(maxChunkSize) {
    // Given the value U, a specific function is created that will throw 
    // when U is exceeded
    return function sort(arr, aStart, aEnd, bStart, bEnd) {
        if (aEnd - aStart > maxChunkSize || bStart - bEnd > maxChunkSize) {
            throw new Error("SORT called with a too large range");
        }
        const aLength = aEnd - aStart;
        let sorted = [...arr.slice(aStart, aEnd), ...arr.slice(bStart, bEnd)]
                .map((x, i) => [x, i]) // temporarily store the unsorted index with each value
                .sort(([x],[y]) => x-y); //... then sort numerically
        // Count the number of values in the first chunk that originally came from A (had a low index)   
        let count = sorted.slice(0, aEnd-aStart).reduce((count, [,i]) => count + (i < aLength), 0);
        sorted = sorted.map(([x]) => x); // remove the temporary index info
        // Populate the original arrays with the sorted values
        arr.splice(aStart, aLength, ...sorted.splice(0, aLength));
        arr.splice(bStart, bEnd-bStart, ...sorted);
        return count;
    }
}

function blackboxSort(arr, maxChunkSize, sort) {
    // Sorting algorithm based on HeapSort, but only using
    //    the given sort function for any data access
    let numCalls = 0;
    let numChunks = Math.ceil(arr.length / maxChunkSize);

    // First some local functions:
    function chunkRange(i, size) {
        // Given a chunk-number, return the corresponding start-end indexes in the array
        // Last chunk may be smaller in size when array size is not multiple of chunk size
        i *= maxChunkSize;
        return [Math.min(i, arr.length), Math.min(i + size, arr.length)];
    }
    
    // Wrapper around SORT function:
    function chunkPairSort(i, j, size=maxChunkSize) {
        let [iStart, iEnd] = chunkRange(i, size);
        let [jStart, jEnd] = chunkRange(j, size);
        numCalls++; // Keep track of how many calls to SORT are made
        // Return true when the call to SORT exchanged at least one value between the chunks
        return sort(arr, iStart, iEnd, jStart, jEnd) < iEnd - iStart;
    }
    
    function isLessThan(i, j) {
        // Returns true when the first value in the first chunk is smaller 
        // than the first value in the second chunk. The second call will
        // undo the change if the first call made a swap:
        return !chunkPairSort(i, j, 1) || !chunkPairSort(j, i, 1);
    }
    
    function siftDown(parent, numChunks) {
        let child = parent*2+1;
        let dirty = true;
        while (dirty && child < numChunks) {
            if (child + 1 < numChunks) { // There are 2 children
                // We know that each child has its values already sorted.
                // Check which of the two children has the smallest value.
                // Then sort the two children in a way that keeps the smallest value 
                // in the child where it currently is. However, if the children are
                // leaves, always sort with the leftmost child getting the greatest values,
                // so we are sure the child with the greatest values has a full chunk size.
                if (child*2+1 < numChunks && isLessThan(child, child+1)) {
                    chunkPairSort(child, child+1);
                    child++; // pick the greatest child.
                } else {
                    chunkPairSort(child+1, child);
                }
            }
            // Perform the actual sift-down-swap
            dirty = chunkPairSort(child, parent);
            // Walk further down the heap
            parent = child;
            child = child*2+1;
        }
    }
    
    function heapify(numChunks) {
        const lastParent = (numChunks-2) >> 1;
        // We would not do this loop in standard heap sort.
        // However, we need each leaf in the future-heap to have the smallest value in first position,
        //   as the implementation of siftDown assumes this.
        for (let chunk = numChunks - 1; chunk > lastParent; chunk-=2) {
            chunkPairSort(chunk-1, chunk);
        }
        // Standard loop in Floyd's algorithm:
        for (let parent = lastParent; parent >= 0; parent--) {
            siftDown(parent, numChunks);
        }
    }
    
    // Main:
    
    if (arr.length <= 2*maxChunkSize) { // Trivial case:
        chunkPairSort(0, 1);
    } else {
        // Build heap: Floyd's method = O(n) where n = number of chunks
        heapify(numChunks);
        // Now all chunks have their values sorted, and are in a maxheap order
        // Extract chunks from the maxheap and place them just after the reduced heap = O(nlogn)
        for (let size = numChunks-1; size > 0; size--) {
            // Swap first chunk with last chunk
            // The first time this may not be a pure swap: when the last chunk is smaller in size.
            //   But even then the effect is fine: the largest values will end up
            //   in that smaller chunk, and the smaller in the (larger) root.
            chunkPairSort(0, size);
            // Root is now "dirty", so sift it down the (reduced) heap
            siftDown(0, size);
        }
    }
    return numCalls;
}

// Snippet I/O handling
function refresh () {
    let array = (document.querySelector("#array").value.match(/-?\d+/g) || []).map(Number);
    let maxChunkSize = Math.max(1, document.querySelector("#U").value) || 1;
    let numCalls = blackboxSort(array, maxChunkSize, createSortFunction(maxChunkSize));
    document.querySelector("#sorted").textContent = JSON.stringify(array);
    document.querySelector("#calls").textContent = numCalls;
}
(document.oninput = refresh)();
document.querySelector("#random").onclick = function () {
    document.querySelector("#array").value = Array.from({length: 50}, () => ~~(Math.random() * 1000)).join` `;
    refresh();
}
Input array: <button id="random">Random</button><br>
<input id="array" style="width:100%" value="5 3 8 4 0 2 7 9 11 1 10 8"><br>
U: <input id="U" type="number" value="3" style="width: 3em"><br>
Result: <span id="sorted"></span><br>
Calls made to sort: <span id="calls"></span>
like image 66
trincot Avatar answered Oct 24 '22 11:10

trincot