Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Mining situation

Suppose I have the data as mentioned below.

11AM user1 Brush

11:05AM user1 Prep Brakfast

11:10AM user1 eat Breakfast

11:15AM user1 Take bath

11:30AM user1 Leave for office

12PM user2 Brush

12:05PM user2 Prep Brakfast

12:10PM user2 eat Breakfast

12:15PM user2 Take bath

12:30PM user2 Leave for office

11AM user3 Take bath

11:05AM user3 Prep Brakfast

11:10AM user3 Brush

11:15AM user3 eat Breakfast

11:30AM user3 Leave for office

12PM user4 Take bath

12:05PM user4 Prep Brakfast

12:10PM user4 Brush

12:15PM user4 eat Breakfast

12:30PM user4 Leave for office

This data tell me about the daily routine of different people. From this data it seems user1 and user2 behave similarly (though there is a difference in time they perform the activity but they are following the same sequence). With the same reason, User3 and User4 behave similarly. Now I have to group such users into different groups. In this example, group1- user1 and USer2 ... followed by group2 including user3 and user4

How should I approach this kind of situation. I am trying to learn data mining and this is an example I thought of as a data mining problem. I am trying to find an approach for the solution, but I can not think of one. I believe this data has the pattern in it. but I am not able to think of the approach which can reveal it. Also, I have to map this approach on the dataset I have, which is pretty huge but similar to this :) The data is about logs stating occurrence of events at a time. And I want to find the groups representing similar sequence of events.

Any pointers would be appreciated.

like image 480
user722856 Avatar asked Sep 30 '11 17:09

user722856


1 Answers

It looks like clustering on top of associating mining, more precisely Apriori algorithm. Something like this:

  1. Mine all possible associations between actions, i.e. sequences Bush -> Prep Breakfast, Prep Breakfast -> Eat Breakfast, ..., Bush -> Prep Breakfast -> Eat Breakfast, etc. Every pair, triplet, quadruple, etc. you can find in your data.
  2. Make separate attribute from each such sequence. For better performance add boost of 2 for pair attributes, 3 for triplets and so on.
  3. At this moment you must have an attribute vector with corresponding boost vector. You can calculate feature vector for each user: set 1 * boost at each position in the vector if this sequence exists in user actions and 0 otherwise). You will get vector representation of each user.
  4. On this vectors use clustering algorithm that fits your needs better. Each found class is the group you use.

Example:

Let's mark all actions as letters:

a - Brush
b - Prep Breakfast
c - East Breakfast
d - Take Bath
...

Your attributes will look like

a1: a->b
a2: a->c
a3: a->d
...
a10: b->a
a11: b->c
a12: b->d
...
a30: a->b->c->d
a31: a->b->d->c
...

User feature vectors in this case will be:

attributes   = a1, a2, a3, a4, ..., a10, a11, a12, ..., a30, a31, ...
user1        =  1,  0,  0,  0, ...,   0,   1,   0, ...,   4,   0, ...
user2        =  1,  0,  0,  0, ...,   0,   1,   0, ...,   4,   0, ...
user3        =  0,  0,  0,  0, ...,   0,   0,   0, ...,   0,   0, ...

To compare 2 users some distance measure is needed. The simplest one is cosine distance, that is just value of cosine between 2 feature vectors. If 2 users have exactly the same sequence of actions, their similarity will equal 1. If they have nothing common - their similarity will be 0.

With distance measure use clustering algorithm (say, k-means) to make groups of users.

like image 184
ffriend Avatar answered Sep 29 '22 11:09

ffriend