Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most Frequent 3 page sequence in a weblog

Tags:

algorithm

Given a web log which consists of fields 'User ' 'Page url'. We have to find out the most frequent 3-page sequence that users takes.

There is a time stamp. and it is not guaranteed that the single user access will be logged sequentially it could be like user1 Page1 user2 Pagex user1 Page2 User10 Pagex user1 Page 3 her User1s page sequence is page1-> page2-> page 3

like image 682
Sundararajan S Avatar asked Jun 07 '10 16:06

Sundararajan S


4 Answers

Assuming your log is stored in timestamp order, here's an algorithm to do what you need:

  1. Create a hashtable 'user_visits' mapping user ID to the last two pages you observed them to visit
  2. Create a hashtable 'visit_count' mapping 3-tuples of pages to frequency counts
  3. For each entry (user, URL) in the log:
    1. If 'user' exists in user_visits with two entries, increment the entry in visit_count corresponding to the 3-tuple of URLs by one
    2. Append 'URL' to the relevant entry in user_visits, removing the oldest entry if necessary.
  4. Sort the visit_count hashtable by value. This is your list of most popular sequences of URLs.

Here's an implementation in Python, assuming your fields are space-separated:

fh = open('log.txt', 'r')
user_visits = {}
visit_counts = {}
for row in fh:
  user, url = row.split(' ')
  prev_visits = user_visits.get(user, ())
  if len(prev_vists) == 2:
    visit_tuple = prev_vists + (url,)
    visit_counts[visit_tuple] = visit_counts.get(visit_tuple, 0) + 1
  user_visits[user] = (prev_vists[1], url)
popular_sequences = sorted(visit_counts, key=lambda x:x[1], reverse=True)
like image 50
Nick Johnson Avatar answered Nov 18 '22 14:11

Nick Johnson


Quick and dirty:

  • Build a list of url/timestamps per user
  • sort each list by timestamp
  • iterate over each list
    • for each 3 URL sequence, create or increment a counter
  • find the highest count in the URL sequence count list

foreach(entry in parsedLog)
{
    users[entry.user].urls.add(entry.time, entry.url)
}

foreach(user in users)
{
    user.urls.sort()
    for(i = 0; i < user.urls.length - 2; i++)
    {
        key = createKey(user.urls[i], user.urls[i+1], user.urls[i+2]
        sequenceCounts.incrementOrCreate(key);
    }
}

sequenceCounts.sortDesc()
largestCountKey = sequenceCounts[0]
topUrlSequence = parseKey(largestCountkey)
like image 30
jball Avatar answered Nov 18 '22 14:11

jball


Here's a bit of SQL assuming you could get your log into a table such as

CREATE TABLE log (
   ord  int,
   user VARCHAR(50) NOT NULL,
   url  VARCHAR(255) NOT NULL,
   ts   datetime
) ENGINE=InnoDB;

If the data is not sorted per user then (assuming that ord column is the number of the line from the log file)

SELECT t.url, t2.url, t3.url, count(*) c
FROM  
      log t INNER JOIN
      log t2 ON t.user = t2.user INNER JOIN
      log t3 ON t2.user = t3.user
WHERE 
      t2.ord IN (SELECT MIN(ord) 
                 FROM log i 
                 WHERE i.user = t.user AND i.ord > t.ord) 
      AND
      t3.ord IN (SELECT MIN(ord) 
                 FROM log i 
                 WHERE i.user = t.user AND i.ord > t2.ord)
GROUP BY t.user, t.url, t2.url, t3.url
ORDER BY c DESC
LIMIT 10;

This will give top ten 3 stop paths for a user. Alternatively if you can get it ordered by user and time you can join on rownumbers more easily.

like image 26
Unreason Avatar answered Nov 18 '22 16:11

Unreason


Source code in Mathematica

s= { {user},{page} }  (* load List (log) here *)

sortedListbyUser=s[[Ordering[Transpose[{s[[All, 1]], Range[Length[s]]}]] ]]

Tally[Partition [sortedListbyUser,3,1]]
like image 1
Dr. belisarius Avatar answered Nov 18 '22 15:11

Dr. belisarius