I need a data structure that always holds the n
largest items inserted so far (in no particular order).
So, if n
is 3, we could have the following session where I insert a few numbers and the content of the container changes:
[] // now insert 1
[1] // now insert 0
[1,0] // now insert 4
[1,0,4] // now insert 3
[1,4,3] // now insert 0
[1,4,3] // now insert 3
[4,3,3]
You get the idea. What's the name of the data structure? What's the best way to implement this? Or is this in some library?
I am thinking to use a container that has a priority_queue
for its elements (delegation), which uses the reverse comparison, so pop
will remove the smallest element. So the insert
function first checks if the new element to be inserted is greater than the smallest. If so, we throw that smallest out and push the new element.
(I have a C++
implementation in mind, but the question is language-agnostic nevertheless.)
The specific datastructure you want is probably the implicit heap. The raw datastructure is just an array; for convenience, say that it is N=2^n elements in size, and that you want to maintain the largest N-1 elements.
The idea is to treat the array (call it A) as a complete binary tree of depth n:
To maintain the tree as a "heap", you need to ensure that each node is smaller than (or equal to) its children. This is called the "heap condition":
To use the heap to maintain the largest N elements:
Basically, this will cause any replacement element to "filter up" the tree until it achieves its natural place. This will take at most n=log2(N) steps, which is as good as you can get. Also, the implicit form of the tree allows a very fast implementation; existing bounded-priority-queue libraries will most likely use an implicit heap.
In Pyhton, use heapq. Create a small wrapper around it, something like this:
class TopN_Queue:
def __init__(self, n):
self.max_sz = n
self.data = []
def add(self, x):
if len(self.data) == self.max_sz:
heapq.heappushpop(self.data, x)
else:
heapq.heappush(self.data, x)
...
In Java you can use a SortedSet implemented e.g. by a TreeSet. After each insertion check if the set is too large, if yes remove the last element.
This is reasonably efficient, I have used it successfully for solving several Project Euler problems.
A priority_queue is the closest thing in C++ with STL. You could wrap it in another class to create your own implementation that trims the size automatically.
Language-agnostically (although maybe not memory-fragmentation-safely):
std::priority_queue does step 2 for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With