User has many Profiles
(Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
(A user can have as many or few attribute values as they like)
Attributes belong to a category
(No overlap. This controls which attribute values a profile can have)
I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:
Profiles and Attributes are just ways of grouping Attribute Values on two levels. Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.
Give each user a similarity rating against each other user.
Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.
Thanks!
First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.
If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).
It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.
At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.
And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:
Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:
cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]
you may use weighted version:
cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]
Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y)
where X
is a matrix with your data vectors on rows (e.g. details of house being sold) and y
is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X
be a combination of 2 user vectors, and corresponding element of y
- similarity between them (you should assign it yourself for a training dataset). E.g.:
X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75 // or whatever you assign to it
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With