I'm trying to cluster and classify users with Mahout. At the moment I am at the planning phase, my mind is completely mixed with ideas, and since I'm relatively new to the area I'm stuck at the data formatting.
Let's say we have two data table (big enough). In the first table there are users and their actions. Every user has at least one action and they can have too many actions, too. About 10000 different user_actions and millions of records are in the table.
user - user_action
u1 - a
u2 - b
u3 - a
u1 - c
u2 - c
u2 - c
u1 - b
u4 - f
u4 - e
u1 - e
u1 - d
u5 - d
In the other table, there're action categories. Every action may have none or multiple categories. There are 60 categories.
user_action - category
a - cat1
b - cat2
c - cat1
d - NULL
e - cat1, cat3
f - cat4
I'm going to try to build a user classification model with Mahout but I've no idea what I should do. What type of user vectors should I create? Or do I really need user vectors?
I think I need to create something like;
u1 (a, c, b, e, d)
u2 (b, c, c)
u3 (a)
u4 (f, e)
u5 ()
Problem in here, some users performed more than 100000 actions (some of them are same actions)
So; this is more useful, I think;
u1 (cat1, cat1, cat2, cat1, cat3)
u2 (cat2, cat1, cat1)
u3 (cat1)
u4 (cat4, cat1, cat3)
u5 ()
The things I also worry about are
Any guidance are welcome.
I would create one row per user as you are doing and I would have one column for each of the categories; this would result in 60 columns if I understand your example correctly. The values of the columns would range from 0 to the maximum number of times the category was seen for the user. The result would be 60 numbers for each user, most of them being 0.
It might be necessary to perform some sort of normalisation on the rows. By analogy with what is done to produce document vectors in text mining, something like term frequency normalisation could be applied to the row. Each column might also require normalising.
From here, clustering could be performed using your algorithm of choice with clustering validity measures to help guide your choice of the most interesting clusterings.
It is the nature of this that you would have to repeat the process iteratively perhaps representing the input data in new ways.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With