Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting top 100 URL from a log file

One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?

We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?

like image 518
Animesh Porwal Avatar asked May 24 '12 08:05

Animesh Porwal


Video Answer


2 Answers

In case we have more than 10GB RAM, just do it straight forward with hashmap.

Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.

Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.

like image 113
Haozhun Avatar answered Sep 21 '22 12:09

Haozhun


Just sort the log file according to the URLs (needs constant space if you chose an algorithm like heap sort or quick sort) and then count for each URL how many times it appears (easy, the lines with the same URLs are next to each other).

Overall complexity is O(n*Log(n)).

Why splitting in many files and keeping only top 3 (or top 5 or top N) for each file is wrong:

     File1 File2 File3
url1   5     0     5
url2   0     5     5
url3   5     5     0
url4   5     0     0
url5   0     5     0
url6   0     0     5
url7   4     4     4

url7 never makes it to the top 3 in the individual files but is the best overall.

like image 35
Thomash Avatar answered Sep 21 '22 12:09

Thomash