Getting top 100 URL from a log file

Question

One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?

We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?

Haozhun · Accepted Answer

In case we have more than 10GB RAM, just do it straight forward with hashmap.

Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.

Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.

Thomash · Answer

Just sort the log file according to the URLs (needs constant space if you chose an algorithm like heap sort or quick sort) and then count for each URL how many times it appears (easy, the lines with the same URLs are next to each other).

Overall complexity is O(n*Log(n)).

Why splitting in many files and keeping only top 3 (or top 5 or top N) for each file is wrong:

     File1 File2 File3
url1   5     0     5
url2   0     5     5
url3   5     5     0
url4   5     0     0
url5   0     5     0
url6   0     0     5
url7   4     4     4

url7 never makes it to the top 3 in the individual files but is the best overall.

Getting top 100 URL from a log file

Tags:

algorithm

data-structures

Animesh Porwal

Video Answer

2 Answers

Haozhun

Thomash

Recent Activity

Donate For Us

Getting top 100 URL from a log file

Tags:

algorithm

data-structures

Animesh Porwal

Video Answer

2 Answers

Haozhun

Thomash

Related questions

Recent Activity

Donate For Us