Given a web log which consists of fields 'User ' 'Page url'. We have to find out the most frequent 3-page sequence that users takes.
There is a time stamp. and it is not guaranteed that the single user access will be logged sequentially it could be like user1 Page1 user2 Pagex user1 Page2 User10 Pagex user1 Page 3 her User1s page sequence is page1-> page2-> page 3
Assuming your log is stored in timestamp order, here's an algorithm to do what you need:
Here's an implementation in Python, assuming your fields are space-separated:
fh = open('log.txt', 'r')
user_visits = {}
visit_counts = {}
for row in fh:
user, url = row.split(' ')
prev_visits = user_visits.get(user, ())
if len(prev_vists) == 2:
visit_tuple = prev_vists + (url,)
visit_counts[visit_tuple] = visit_counts.get(visit_tuple, 0) + 1
user_visits[user] = (prev_vists[1], url)
popular_sequences = sorted(visit_counts, key=lambda x:x[1], reverse=True)
Quick and dirty:
foreach(entry in parsedLog)
{
users[entry.user].urls.add(entry.time, entry.url)
}
foreach(user in users)
{
user.urls.sort()
for(i = 0; i < user.urls.length - 2; i++)
{
key = createKey(user.urls[i], user.urls[i+1], user.urls[i+2]
sequenceCounts.incrementOrCreate(key);
}
}
sequenceCounts.sortDesc()
largestCountKey = sequenceCounts[0]
topUrlSequence = parseKey(largestCountkey)
Here's a bit of SQL assuming you could get your log into a table such as
CREATE TABLE log (
ord int,
user VARCHAR(50) NOT NULL,
url VARCHAR(255) NOT NULL,
ts datetime
) ENGINE=InnoDB;
If the data is not sorted per user then (assuming that ord
column is the number of the line from the log file)
SELECT t.url, t2.url, t3.url, count(*) c
FROM
log t INNER JOIN
log t2 ON t.user = t2.user INNER JOIN
log t3 ON t2.user = t3.user
WHERE
t2.ord IN (SELECT MIN(ord)
FROM log i
WHERE i.user = t.user AND i.ord > t.ord)
AND
t3.ord IN (SELECT MIN(ord)
FROM log i
WHERE i.user = t.user AND i.ord > t2.ord)
GROUP BY t.user, t.url, t2.url, t3.url
ORDER BY c DESC
LIMIT 10;
This will give top ten 3 stop paths for a user. Alternatively if you can get it ordered by user and time you can join on rownumbers more easily.
Source code in Mathematica
s= { {user},{page} } (* load List (log) here *)
sortedListbyUser=s[[Ordering[Transpose[{s[[All, 1]], Range[Length[s]]}]] ]]
Tally[Partition [sortedListbyUser,3,1]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With