Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Similarity Between Users Based On Votes

lets say i have a set of users, a set of songs, and a set of votes on each song:

=========== =========== =======
User        Song        Vote
=========== =========== =======
user1       song1       [score]
user1       song2       [score]
user1       song3       [score]
user2       song1       [score]
user2       song2       [score]
user2       song3       [score]
user3       song1       [score]
user3       song2       [score]
user3       song3       [score]
user-n      song-n      [score]
=========== =========== =======

whats the most efficient way to calculate user similarity based on song-votes? is there a better way than iterating over every user and every vote for every song?

like image 766
Carson Avatar asked Dec 02 '09 22:12

Carson


2 Answers

There are two common metrics that can be used to find similarities between users:

  1. Euclidean Distance, that is exactly what you are thinking: imagine a n-dimensional graph that has for each axis a song that is reviewed by two involved users (u1 and *u2) and the value on its axis is the score. You can easily calculate similarity using the formula:

    for every song reviewed by u1 and u2, calculate pow(u1.song.score - u2.song.score, 2) and add all together into sum_of_powers. Similarity coefficient is then given by 1 / 1 + (sqrt(sum_of_powers)).

  2. Pearson Correlation (or correlation coefficient): it's a better approach that finds how much two data sets are related one with another. This approach uses more complex formulas and a little of statistics background, check it here: wiki. You will have a graph for every couple of users, then you plot points according to scores.. for example if aSong has been voted 2 from u1 and 4 from u2 it will plot the point (2,4) (assuming that user1 is x-axis and u2 is y-axis).

Just to clarify, you use linear regression to find two coefficients A and B, that describe the line that minimizes the distance from all the points of the graph. This line has this formula:y = Ax + B. If two sets are similar points should be near to the main diagonal so A should tend to 1 while B to 0. Don't assume this explaination as complete or as a reference because it lacks soundness and typical math formalism, it just to give you an idea.

EDIT: like written by others, more complex algorithms to cluster data exist, like k-means but I suggest you to start from easy ones (actually you should need something more difficult just when you realize that results are not enough).

like image 107
Jack Avatar answered Oct 09 '22 05:10

Jack


I recommend the book Programming Collective Intelligence from Toby Segaran. Chapter 3 describes different clustering methods like Hierarchical Clustering and K-means Clustering.

The source code for the examples is available here

like image 44
Peter Hoffmann Avatar answered Oct 09 '22 05:10

Peter Hoffmann