Find top k visiting URL for last day, or last hour, or last minute?

Question

The original question is given file containing 5GB URL being visited last day, find the top k frequent URL. The problem can be solved by using hash map to count the occurrences of distinct URL and find top k with the help of min heap, taking a O(n log k) time.

Now I'm thinking what if the input was unlimited online data stream (instead of static file), then how can I know the top k URL of the last day?

Or is there any improvement that I can made to the system that allow me to get top k URL for last minute and last day and last hours dynamically?

Any hint will be appreciated!!

templatetypedef · Accepted Answer

If you are willing to settle for a probabilistic answer that might contain a few wrong entries, you should definitely look into the count-min sketch data structure. It was specifically designed to estimate frequent elements in a stream using as little memory as possible, and most implementations support a very time and space efficient approximation of the top k elements out of a stream. Moreover, the structure lets you tune space usage, which makes it ideal for situations like these. IIRC Google uses this to determine their most frequent search queries.

There are several implementations of this data structure available online.

Hope this helps!

There are several implementations of this data structure available online.

Hope this helps!

Find top k visiting URL for last day, or last hour, or last minute?

Tags:

algorithm

data-structures

hash

binary-heap

streaming-algorithm

user1941469

1 Answers

templatetypedef

Recent Activity

Donate For Us

Find top k visiting URL for last day, or last hour, or last minute?

Tags:

algorithm

data-structures

hash

binary-heap

streaming-algorithm

user1941469

1 Answers

templatetypedef

Related questions

Recent Activity

Donate For Us