Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the most common three-item sequence in a very large file

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.

Sample log file:

User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4

Corresponding results:

A: 1-2-3, 2-3-4
B: 2-3-4
2-3-4 is the most popular three-page sequence

My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.

However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.

How can I do this better?

like image 270
user1002288 Avatar asked Dec 30 '11 19:12

user1002288


1 Answers

If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.

If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.

Update (sort by timestamp)

Some preprocessing may be needed to properly use logfiles' timestamps:

  • If the logfiles are already sorted by timestamp, no preprocessing needed.
  • If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
  • If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
  • If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.

Update2 (Improving approximate method)

Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.

Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.

like image 85
Evgeny Kluev Avatar answered Oct 19 '22 19:10

Evgeny Kluev