Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort a large collection while showing progress

What is the best way to sort a collection while updating a progress bar? Currently I have code like this:

for (int i = 0; i < items.size(); i++)
{
    progressBar.setValue(i);

    // Uses Collections.binarySearch:
    CollectionUtils.insertInOrder(sortedItems, item.get(i));
}

This shows progress but the progress bar slows down as the number of items in sortedItems grows larger. Does anyone have a better approach? Ideally I'd like to use an interface similar to Collections.sort() so that I try different sorting algorithms.

Any help would be great!



As a bit of background, this code is pulling back lots of documents (1-10 million) from Lucene and running a custom comparator over them. Sorting them by writing data back onto the disk will be way too slow to be practical. Most of the cost is reading the item off the disk and then running the comparator over the items. My PC has loads of memory so there is no issues relating to swapping to disk, etc.

In the end I went with Stephen's solution since it was very clean and allowed me to easily add a multi-threaded sorting algorithm.

like image 404
Luke Quinane Avatar asked Oct 18 '10 01:10

Luke Quinane


1 Answers

You want to be careful here. You've chosen to use an algorithm that incrementally builds a sorted data structure so that (I take it) you can display a progress bar. However, in doing this, you may have chosen a sorting method that is significantly slower than the optimal sort. (Both sorts will be O(NlogN) but there's more to performance than big-O behaviour ...)

If you are concerned that this might be an issue, compare the time to sort a typical collection using TreeMap and Collections.sort. The latter works by copying the input collection into an array, sorting the array and then copying it back. (It works best if the the input collection is an ArrayList. If you don't need the result as a mutable collection you can avoid the final copy back by using Collection.toArray, Arrays.sort and Arrays.asList instead.)

An alternative idea would be to use a Comparator object that keeps track of the number of times that it has been called, and use that to track the sort's progress. You can make use of the fact that the comparator is typically going to be called roughly N*log(N) times, though you may need to calibrate this against the actual algorithm used1.

Incidentally, counting the calls to the comparator will give you a better indication of progress than you get by counting insertions. You won't get the rate of progress appearing to slow down as you get closer to completing the sort.

(You'll have different threads reading and writing the counter, so you need to consider synchronization. Declaring the counter as volatile would work, at the cost of extra memory traffic. You could also just ignore the issue if you are happy for the progress bar to sometimes show stale values ... depending on your platform, etc.)


1 - There is a problem with this. There are some algorithms where the number of comparisons can vary drastically depending on the initial order of the data being sorted. For such an algorithm, there is no way to calibrate the counter that will work in "non-average" cases.

like image 174
Stephen C Avatar answered Nov 01 '22 07:11

Stephen C