Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which algorithm/implementation for weighted similarity between users by their selected, distanced attributes?

Data Structure:

User has many Profiles
    (Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
    (A user can have as many or few attribute values as they like)
Attributes belong to a category
    (No overlap. This controls which attribute values a profile can have)

Example/Context:

I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:

  • Profile: Video, so Video profile only contains Attributes of Video category
  • Attributes, so an Attribute in the Video category may be Genre
  • Attribute Values, e.g. Comedy, Action, Thriller are all Attribute Values

Profiles and Attributes are just ways of grouping Attribute Values on two levels. Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.

Problem:

Give each user a similarity rating against each other user.

  1. Similarity based on All Attribute Values associated with the user.
    • Flat/one level
    • Unequal number of attribute values between two users
    • Attribute value can only be selected once per user, so no duplicates
    • Therefore, binary string/boolean array with Cosine Similarity?
  2. 1 + Weight Profiles
    • Give each profile a weight (totaling 1?)
    • Work out profile similarity, then multiply by weight, and sum?
  3. 1 + Weight Attribute Categories and Profiles
    • As an attribute belongs to a category, categories can be weighted
    • Similarity per category, weighted sum, then same by profile?
    • Or merge profile and category weights
  4. 3 + Distance between every attribute value
    • Table of similarity distance for every possible value vs value
    • Rather than similarity by value === value
    • 'Close' attributes contribute to overall similarity.
    • No idea how to do this one

Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.

Thanks!

like image 489
StringsOnFire Avatar asked Feb 11 '14 18:02

StringsOnFire


1 Answers

First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.

If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).

It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.

At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.

And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:

  1. Run feature selection to avoid noisy variables.
  2. Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
  3. Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.

Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:

cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]

you may use weighted version:

cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]

Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y) where X is a matrix with your data vectors on rows (e.g. details of house being sold) and y is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X be a combination of 2 user vectors, and corresponding element of y - similarity between them (you should assign it yourself for a training dataset). E.g.:

X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75  // or whatever you assign to it

HTH

like image 140
ffriend Avatar answered Nov 03 '22 01:11

ffriend