The original question is given file containing 5GB URL being visited last day, find the top k frequent URL. The problem can be solved by using hash map to count the occurrences of distinct URL and find top k with the help of min heap, taking a O(n log k) time.
Now I'm thinking what if the input was unlimited online data stream (instead of static file), then how can I know the top k URL of the last day?
Or is there any improvement that I can made to the system that allow me to get top k URL for last minute and last day and last hours dynamically?
Any hint will be appreciated!!
If you are willing to settle for a probabilistic answer that might contain a few wrong entries, you should definitely look into the count-min sketch data structure. It was specifically designed to estimate frequent elements in a stream using as little memory as possible, and most implementations support a very time and space efficient approximation of the top k elements out of a stream. Moreover, the structure lets you tune space usage, which makes it ideal for situations like these. IIRC Google uses this to determine their most frequent search queries.
There are several implementations of this data structure available online.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With