Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently merging and re-sorting sorted lists

This isn't the classic "merging two sorted" lists questions, which is fairly trivial to do in linear time.

What I'm trying to do is merge two lists of (key, value) pairs, already sorted by value, where there are objects with the same key in both lists: such objects should have their values merged (added), which may change their sort order. I'm primarily interested in how the sort can be efficiently performed using information from the already sorted lists, since the sort is the slowest part of this algorithm.

Let's take a concrete example. Imagine a List of Student objects:

class Student {
  final String name;
  final int score;
  ...
}

Given as input two List<Student> sorted by score, I'd like to create new merged list of students, where any student (identified by Student.name) appearing in both lists appears once in the final list, with a score equal to the sum of their score in both lists. The original lists should be left unmodified.

E.g.,

List 1:
{"bob", 20}
{"john", 15}
{"mark", 14}

List 2:
{"bill", 11}
{"mark", 9}
{"john", 1}

Result:
{"mark", 23}
{"bob", 20}
{"john", 16}
{"bill", 11}

The merging itself (identifying students that appear in both lists) can be done in expected O(1) time using any O(1) lookup/insert structure such as HashMap. What I'm most interested in is the sort step (although I don't exclude solutions that do the merging and the sorting at the same time).

The question though, is how do I efficiently re-sort such a list? The ordering of the existing lists clearly puts some constraints on the final position of elements in the merged list. For example, if a student is at position i in the first list and j in the second, he must appear among the first i + j students in the merged list by a simple argument analyzing the maximum number of students that could have a higher score. It's not immediately clear if this information would be useful in sorting the list, however.

You can assume that in many cases students that score highly in one list score highly in the other. The algorithm should work when that is not the case, but it gives you some additional information about the distribution that may be useful, in addition to the fact that the lists are already sorted.

It seems like this type of operation would be common for any type of distributed query + sorting implementation. For example, imagine a "select state,count(*) group by state" type of query issue against a distributed system (to count the number of records in each state) - naturally you'd get a sorted list of (state, count) objects back from each node, and then you'd want to merge and re-sort those during the reduce operation. It seems silly to throw away all the work already done on the distributed nodes.

Quantitative Notes

I'm interested in the case where the lists to be merged and re-sorted are small: usually around 256 entries. The range of scores varies, from 0 to 100 in the some cases, up to about 0 - 10,000,000 in others. Of course, given the small number of elements, each operation will be fast in absolute time, even with naive algorithms - but performed billions of times, it adds up.

In fact, one of the answers below has proven that you can't, in general, do this better than a plain sort for increasing list sizes (i.e., taking n to be the combined list size) - but I'm actually more interested in doing this many times, for fixed size lists, with good empirical performance.

like image 438
BeeOnRope Avatar asked Jun 11 '16 22:06

BeeOnRope


1 Answers

It sounds like you need to use an adaptive sort algorithm.

"A sorting algorithm falls into the adaptive sort family if it takes advantage of existing order in its input. It benefits from the presortedness in the input sequence – or a limited amount of disorder for various definitions of measures of disorder – and sorts faster. Adaptive sorting is usually performed by modifying existing sorting algorithms." - Wikipedia article linked above.

Examples include insertion sort and Timsort; see the article above for more. Note that in Java 8, the Arrays.sort(Object[]) library method uses a modified Timsort.


I am not aware of any published algorithm that deals with the specific requirements of your example, but here is an idea:

  1. Perform a classic merge on the two input lists L1 and L2:

    • When you merge a pair of objects and it changes the keys that determine the ordering, put the merged object into temporary list A.
    • Otherwise put the objects into temporary list B ... which will remain ordered.
  2. Sort the temporary list A.

  3. Merge lists A and B.

Assuming that:

  • the lengths of the original lists L1 & L2 are M & N respectively, and
  • the number of merged objects whose keys changed is R (which is less than max(M, N)),

then the overall complexity is O(M + N + RlogR). If R is small relative to M + N, then this should be an improvement.


In your example, every case where there is a match between elements in the input lists is likely to move the element in the order. If it moves the element, it will move to later in the order (and never earlier). So another idea is to do a three-way merge between a the original 2 lists and a priority queue. When you get a match, you merge the counts and add the result to the priority queue.

The complexity is similar to the previous, but you avoid extra pass to merge the lists. And also the RlogR becomes RlogA where A is the average size of the priority queue.


Keep in mind that I'm especially interested in the case where R is approximately equal to max(M,N), and also M == N.

(You didn't state that in your question! And, in fact it doesn't make any sense for R to be > min(M,N)!)

In that case, maybe just use the priority queue as an incremental sorter. Throw all merged records and all records that cannot be merged into the queue, and pull our records if when they have a key / score that is less than the current heads of the two lists. Assuming that M and N are the list lengths, and A is the average priority queue size, then the complexity is max(M,N) * log A). Whether this is an improvement on simple re-sort will depend on whether the average A is significantly (in Big O terms) less than max(M,N). That will depend on the inputs ... and the merging function.


The number (N) varies, but 256 to 1,000 is typical. Perhaps as much as 10,000.

For lists of that typical size, you are down at a level where the complexity analysis is not going to be helpful. But also, you are down at a level where optimization becomes pointless ... unless you are doing the operation many, many times, or on a tight "time budget".


This is all very approximate, and my maths are "sketchy" at best.

A proper investigation would entails hundreds of hours to research, code, test, benchmark, analyze various alternatives ... and we'd probably still get the answer that it depends on the input data set size and distribution.

like image 51
Stephen C Avatar answered Oct 17 '22 20:10

Stephen C