Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixed-size collection that keeps top (N) values

My code processes a huge number of values and I'm looking for an efficient structure to keep track of the top (N) values, where N is less than 10, so collecting ALL numbers then sorting the list and taking the first (N) is probably not the most efficient way.

To do that, I'm building a collection of fixed size N, to keep the top (N) values sorted in descending order. The Add(T value) method of the sorted collection would add the value to the collection if value is higher than any of the existing values (in which case the last element is removed) or if the collection is not full.

I was able to implement what I wanted using a doubly LinkedList<T> since it has fast insertion and removal, but I was wondering if using SortedDictionary<TKey, TValue> or a priority queue would be better ?

Thank you.

like image 959
alhazen Avatar asked Aug 20 '10 15:08

alhazen


5 Answers

I would simply use a heap with a limited depth. I do not know whether there already exists a library for that, but it should be easy to implement.

like image 72
Svante Avatar answered Nov 01 '22 01:11

Svante


The main advantage to use a SortedDictionary or SortedList it is that you can skip the sorting intelligence because they handle it for you( e.g. You just have to remove the (n + 1)th element every time you add a value). But on the other hands adopt that sort of complex structure for 10 elements resembles to use a nuke to kill a fly...

Maybe the linked list is a good way, and also a simple linear comparison for inserting values in order is not so slower than binary search (we still speak about max 10 comparisons against ~3, current CPUs not event feel the difference).

EDIT:

fixed arrays can be used to build prioriry queues with binary heaps, that probably is the right way to implement this

like image 20
digEmAll Avatar answered Nov 01 '22 03:11

digEmAll


The performance may really change.

For N < 10 any overly complex data structure will likely drag performance significantly (though perhaps not catastrophically) so I'd use an array to store the items.

Then there are 3 main possibilities on how to arrange the items in the array:

  1. sorted is probably the best choice to keep things simple:
    • constant time to determine whether to insert a new item (compare with lowest)
    • O(N) time to insert - but this only happens for items that are in the N best-so-far. And if your input is sufficiently random, the average time will be even lower because most insertions will only move a few elements at the bottom of the top.
  2. unsorted:
    • O(N) time for each input element, that's too much compared to "sorted"
  3. binary heap that implements a priority queue: more complex to implement but maybe even faster than "sorted"
    • constant time to determine whether to insert a new item (compare with lowest)
    • O(log N) time to insert - and this only happens for items that are in the N best-so-far
like image 23
user192472 Avatar answered Nov 01 '22 02:11

user192472


For such a small number, just keep an array. Scan the array keeping track of the smallest value and its position. If your new number is larger than the smallest on in the set, replace it. You should of course scan for the lowest value once after you insert a number, then just compare new numbers to that and only take action if you have something larger (replace and rescan).

like image 3
phkahler Avatar answered Nov 01 '22 03:11

phkahler


Unless you have a solid reason to do otherwise, I'd use a priority queue.

There is one trick that can simplify the logic quite a bit. Most people's first idea is to look at each incoming item, and insert it into the collection iff the collection contains fewer items than desired, or the new item is larger than the smallest item currently in the collection.

You can simplify things quite a bit if you leave room for one extra item in the collection. Always insert each incoming item into the collection, and then if the collection is too large, remove the smallest item.

While a priority queue is arguably overkill for only 10 items, it keeps the logic simple, and is efficient both in terms of space and time, so if you ever need N=10000 (or whatever) it'll still work nicely.

like image 2
Jerry Coffin Avatar answered Nov 01 '22 01:11

Jerry Coffin