I'm in the process of designing a website that is built around the concept of recommending various items to users based on their tastes. (i.e. items they've rated, items added to their favorites list, etc.) Some examples of this are Amazon, Movielens, and Netflix. Now, my problem is, I'm not sure where to start in regards to the mathematical part of this system. I'm willing to learn the math that's required, it's just I don't know what type of math is required. I've looked at a few of the publications over at Grouplens.org, specifically "Towards a Scalable kNN CF Algorithm: Exploring Effective Applications of Clustering." (pdf) I'm pretty good at understanding everything until page 5 "Prediction Generation" p.s. I'm not exactly looking for an explanation of what's going on, though that might be helpful, but I'm more interested in the math I need to know. This way I can understand what's going on.

Let me explain the procedure that the authors introduced (as I understood it): Input: <ul> <li> Training data: users, items, and ratings of users to these items (not necessarily each user rated all items)</li> <li> Target user: a new user with some ratings of some items</li> <li> Target item: an item not rated by target user that we would like to predict a rating for it.</li> </ul> Output: <ul> <li>prediction for the target item by target user</li> </ul> This can be repeated for a bunch of items, and then we return the N-top items (highest predicted ratings) Procedure: The algorithm is very similar to the naive KNN method (search all training data to find users with similar ratings to the target user, then combine their ratings to give prediction [voting]). This simple method does not scale very well, as the number of users/items increase. The algorithm proposed is to first cluster the training users into K groups (groups of people who rated items similarly), where K << N (N is the total number of users). Then we scan those clusters to find which one the target user is closest to (instead of looking at all the training users). Finally we pick l out of those and we make our prediction as an average weighted by the distance to those l clusters. Note that the similarity measure used is the correlation coefficient, and the clustering algorithm is the bisecting K-Means algorithm. We can simply use the standard kmeans, and we can use other similarity metrics as well such as Euclidean distance or cosine distance. The first formula on page 5 is the definition of the correlation: <pre class="prettyprint"><code>corr(x,y) = (x-mean(x))(y-mean(y)) / std(x)*std(y) </code></pre> The second formula is basically a weighted average: <pre class="prettyprint"><code>predRating = sum_i(rating_i * corr(target,user_i)) / sum(corr(target,user_i)) where i loops over the selected top-l clusters </code></pre> Hope this clarifies things a little bit :)

Building a Collaborative filtering / Recommendation System

Tags:

coldfusion

math

collaborative-filtering

recommendation-engine

I'm in the process of designing a website that is built around the concept of recommending various items to users based on their tastes. (i.e. items they've rated, items added to their favorites list, etc.) Some examples of this are Amazon, Movielens, and Netflix.

Now, my problem is, I'm not sure where to start in regards to the mathematical part of this system. I'm willing to learn the math that's required, it's just I don't know what type of math is required.

I've looked at a few of the publications over at Grouplens.org, specifically "Towards a Scalable kNN CF Algorithm: Exploring Effective Applications of Clustering." (pdf) I'm pretty good at understanding everything until page 5 "Prediction Generation"

p.s. I'm not exactly looking for an explanation of what's going on, though that might be helpful, but I'm more interested in the math I need to know. This way I can understand what's going on.

836

asked Oct 03 '09 02:10

John

1 Answers

Let me explain the procedure that the authors introduced (as I understood it):

Input:

Training data: users, items, and ratings of users to these items (not necessarily each user rated all items)
Target user: a new user with some ratings of some items
Target item: an item not rated by target user that we would like to predict a rating for it.

Output:

prediction for the target item by target user

This can be repeated for a bunch of items, and then we return the N-top items (highest predicted ratings)

Procedure:
The algorithm is very similar to the naive KNN method (search all training data to find users with similar ratings to the target user, then combine their ratings to give prediction [voting]).
This simple method does not scale very well, as the number of users/items increase.

The algorithm proposed is to first cluster the training users into K groups (groups of people who rated items similarly), where K << N (N is the total number of users).
Then we scan those clusters to find which one the target user is closest to (instead of looking at all the training users).
Finally we pick l out of those and we make our prediction as an average weighted by the distance to those l clusters.

Note that the similarity measure used is the correlation coefficient, and the clustering algorithm is the bisecting K-Means algorithm. We can simply use the standard kmeans, and we can use other similarity metrics as well such as Euclidean distance or cosine distance.

The first formula on page 5 is the definition of the correlation:

corr(x,y) = (x-mean(x))(y-mean(y)) / std(x)*std(y)

The second formula is basically a weighted average:

predRating = sum_i(rating_i * corr(target,user_i)) / sum(corr(target,user_i))
               where i loops over the selected top-l clusters

Hope this clarifies things a little bit :)

188

answered Oct 14 '22 10:10

Amro

Related questions
                            
                                BASH: how to perform arithmetic on numbers in a pipe
                            
                                Python computing error
                            
                                Strange behaviour of gcc and math.h? [duplicate]
                            
                                Big-O notation for two simple recursive functions
                            
                                How do Javascript Math.max and Math.min actually work?
                            
                                Given vector of one axis, how do I find vectors of other two axes?
                            
                                Input an integer, find the two closest integers which, when multiplied, equal the input
                            
                                How to calculate the points between two given points and given distance?
                            
                                How to extract digits from a number from left to right?
                            
                                PHP - Large Integer mod calculation
                            
                                Mathematical Programming Languages
                            
                                How to use linear interpolation estimate current position between two Geo Coordinates?
                            
                                Getting Factors of a Number
                            
                                How to transform mouse location in isometric tiling map?
                            
                                Solving the recurrence relation T(n) = √n T(√n) + n [closed]
                            
                                Parsing and computing boolean set definitions
                            
                                Format string into scientific notation
                            
                                get angle of a line from horizon
                            
                                Scaling around a specific point in 2d coordinate system
                            
                                How to do an inverse log transformation in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With