Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find top k visiting URL for last day, or last hour, or last minute?

The original question is given file containing 5GB URL being visited last day, find the top k frequent URL. The problem can be solved by using hash map to count the occurrences of distinct URL and find top k with the help of min heap, taking a O(n log k) time.

Now I'm thinking what if the input was unlimited online data stream (instead of static file), then how can I know the top k URL of the last day?

Or is there any improvement that I can made to the system that allow me to get top k URL for last minute and last day and last hours dynamically?

Any hint will be appreciated!!

like image 308
user1941469 Avatar asked Jan 02 '13 05:01

user1941469


1 Answers

If you are willing to settle for a probabilistic answer that might contain a few wrong entries, you should definitely look into the count-min sketch data structure. It was specifically designed to estimate frequent elements in a stream using as little memory as possible, and most implementations support a very time and space efficient approximation of the top k elements out of a stream. Moreover, the structure lets you tune space usage, which makes it ideal for situations like these. IIRC Google uses this to determine their most frequent search queries.

There are several implementations of this data structure available online.

Hope this helps!

like image 117
templatetypedef Avatar answered Oct 23 '22 06:10

templatetypedef