Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for efficiently returning the top-K entries of a hash table (map, dictionary)

Here's a description:

It operates like a regular map with get, put, and remove methods, but has a getTopKEntries(int k) method to get the top-K elements, sorted by the key:

For my specific use case, I'm adding, removing, and adjusting a lot of values in the structure, but at any one time there's approximately 500-1000 elements; I want to return the entries for the top 10 keys efficiently.

  • I call the put and remove methods many times.
  • I call the getTopKEntries method.
  • I call the put and remove methods some more times.
  • I call the getTopKEntries method.
  • ...

I'm hoping for O(1) get, put, and remove operations, and for getTopKEntries to be dependent only on K, not on the size of the map.

So what's a data structure for efficiently returning the top-K elements of a map?

My other question is similar, but is for the case of returning all elements of a map, sorted by the key.

If it helps, both the keys and values are 4-byte integers.

like image 395
Rudiger Avatar asked Jan 20 '10 21:01

Rudiger


2 Answers

A binary search tree (i.e. std::map in C++) sounds like the perfect structure: it’s already lexicographically ordered, i.e. a simple in-order traversal will yield the elements in ascending order. Hence, iterating over the first k elements will yield the top k elements directly.

Additionally, since you foresee a lot of “remove” operations, a hash table won’t be well-suited anyway: remove operations destroy the load factor characteristics of hash tables which leads to a rapid deterioration of the runtime.

like image 197
Konrad Rudolph Avatar answered Sep 16 '22 15:09

Konrad Rudolph


I'm not sure I accept fully Konrad's view that lots of remove operations would destroy the structure of a hash table.

Without remove operations, you could keep all the objects in a hash table, and keep the top K in a priority heap that'd be incrementally updated. This would make insert O(1 + log K), i.e. constant-time in N assuming K is constant and not dependent on N (N = number of objects in the table). However this doesn't work when you have the remove operation available. The proposed Fibonacci heap has O(log N) amortized delete operation so it doesn't give a good solution either, as all the objects would be need to kept in the heap, and if you eventually remove every object that you insert, you get O(log N) behavior in general per a pair of insert+delete.

I would maybe try the following approach:

Store the objects in a hash table, assuming you need the whole table for some other purposes than returning the top objects. Maintain a priority heap (standard heap goes) that contains K * C objects for C whose value you need to search for experimentally. Whenever you add a new object, try to insert it in the heap; if it fits in the KC space (heap is not at capacity yet or it pushes another object away), insert it and set a bit in the hash table to denote that the object is in the heap; when you push an object out of the heap, clear the bit. When you remove an object, check the bit; if the bit=1 i.e. the object was in the heap remove it from there (you need to search for it unless you have a pointer to it from the hash table; it's best to maintain the pointer). What happens now is that the heap shrinks. They key thing is that as long as the heap has still at least K objects it is guaranteed to contain all the top K objects. This is where the factor C comes in as it provides the "leeway" for the heap. When the heap size drops BELOW K, you run a linear scan over the whole hash table and fill the heap back to KC capacity.

Setting C is empirical because it depends on how your objects come and go; but tuning it should be easy as you can tune it just based on runtime profiling.

Complexity: Insert is O(1 + log (KC)). Remove is O(1 + p log (KC) + q N) where p is the probability that a removed object was in the heap, and q is the probability that the heap needs to be rebuilt. p is dependent on the characteristics of how objects come and go. For a simple analysis we can set p=(KC/N), i.e. assume uniform probability. q is even more sensitive to the "flow" of the objects. For example, if new objects in general increase in their value over time and you always delete older objects, q tends towards zero.

Note that funnily enough p is inversely proportional to N so actually this part speeds up when N grows :)

like image 31
Antti Huima Avatar answered Sep 19 '22 15:09

Antti Huima