lets say i have a set of users, a set of songs, and a set of votes on each song:
=========== =========== =======
User Song Vote
=========== =========== =======
user1 song1 [score]
user1 song2 [score]
user1 song3 [score]
user2 song1 [score]
user2 song2 [score]
user2 song3 [score]
user3 song1 [score]
user3 song2 [score]
user3 song3 [score]
user-n song-n [score]
=========== =========== =======
whats the most efficient way to calculate user similarity based on song-votes? is there a better way than iterating over every user and every vote for every song?
There are two common metrics that can be used to find similarities between users:
Euclidean Distance, that is exactly what you are thinking: imagine a n-dimensional graph that has for each axis a song that is reviewed by two involved users (u1 and *u2) and the value on its axis is the score. You can easily calculate similarity using the formula:
for every song reviewed by u1 and u2, calculate pow(u1.song.score - u2.song.score, 2)
and add all together into sum_of_powers
. Similarity coefficient is then given by 1 / 1 + (sqrt(sum_of_powers))
.
Pearson Correlation (or correlation coefficient): it's a better approach that finds how much two data sets are related one with another. This approach uses more complex formulas and a little of statistics background, check it here: wiki. You will have a graph for every couple of users, then you plot points according to scores.. for example if aSong
has been voted 2
from u1 and 4
from u2 it will plot the point (2,4)
(assuming that user1 is x-axis and u2 is y-axis).
Just to clarify, you use linear regression to find two coefficients A
and B
, that describe the line that minimizes the distance from all the points of the graph. This line has this formula:y = Ax + B
. If two sets are similar points should be near to the main diagonal so A
should tend to 1 while B
to 0. Don't assume this explaination as complete or as a reference because it lacks soundness and typical math formalism, it just to give you an idea.
EDIT: like written by others, more complex algorithms to cluster data exist, like k-means but I suggest you to start from easy ones (actually you should need something more difficult just when you realize that results are not enough).
I recommend the book Programming Collective Intelligence from Toby Segaran. Chapter 3 describes different clustering methods like Hierarchical Clustering and K-means Clustering.
The source code for the examples is available here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With