One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?
We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?
In case we have more than 10GB RAM, just do it straight forward with hashmap.
Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.
Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.
Just sort the log file according to the URLs (needs constant space if you chose an algorithm like heap sort or quick sort) and then count for each URL how many times it appears (easy, the lines with the same URLs are next to each other).
Overall complexity is O(n*Log(n)).
Why splitting in many files and keeping only top 3 (or top 5 or top N) for each file is wrong:
File1 File2 File3
url1 5 0 5
url2 0 5 5
url3 5 5 0
url4 5 0 0
url5 0 5 0
url6 0 0 5
url7 4 4 4
url7 never makes it to the top 3 in the individual files but is the best overall.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With