Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Predict with SVD matrixes

I'm participating in programming contest, where I have data where the first column is a user, second column is a movie, and the third is a number in ten-points rating system.

0 0 9
0 1 8
1 1 4
1 2 6
2 2 7

And I have to predict the third column (user, movie, ?):

0 2
1 0
2 0
2 1

Also I know the answers:

0 2 7.052009
1 0 6.687943
2 0 6.995272
2 1 6.687943

This data in a table: Rows are users 0, 1 and 2; columns are movies 0, 1 and 2; cells are scores, 0 were not voted on:

     [,1] [,2] [,3]
[1,]    9    8    0
[2,]    0    4    6
[3,]    0    0    7

I use R lang for get SVD:

$d
[1] 12.514311  9.197763  2.189331

$u
          [,1]       [,2]       [,3]
[1,] 0.9318434 -0.3240669  0.1632436
[2,] 0.3380257  0.6116879 -0.7152458
[3,] 0.1319333  0.7216776  0.6795403

$v
          [,1]        [,2]       [,3]
[1,] 0.6701600 -0.31709904  0.6710691
[2,] 0.7037423 -0.01584988 -0.7102785
[3,] 0.2358650  0.94825998  0.2125341

Transposed v is:

          [,1]        [,2]       [,3]
[1,]  0.6701600   0.7037423   0.2358650
[2,] -0.31709904 -0.01584988  0.94825998
[3,]  0.6710691  -0.7102785   0.2125341

And I read about predicting movie ratings using this formula: enter image description here

But I don't understand how to predict ratings like this:

0 2 7.052009
1 0 6.687943
2 0 6.995272
2 1 6.687943

For this data:

0 2
1 0
2 0
2 1
like image 503
rel1x Avatar asked Jan 08 '23 15:01

rel1x


2 Answers

There are several things that seem incorrect to me with your example. First, when you don't have a ranking available for a specific user / movie combination, then you should not fill this with zero. This would tell SVD or any other type of principal component analysis (PCA) that these are the ranks (which are artificially low). Furthermore, covariances calculated with zero-filled data would be computed based on an incorrect number of observations.

The Netflix prize winner (link for more info) that utilized the SVD approach also must have used some sort of missing data PCA routine. In that case, the non-values should not be zero, but rather NaN, although I haven't seen the details of the actual approach that they used.

The second question that I have is if "the answer" that you provide is really based on the rather small dataset that you give in the example. Given the 3 users by 3 movie dataset, there are very few locations for the calculation of correlations between users, so any prediction will be very poor. Nevertheless, I was able to produce a result, but it doesn't match your expected answer.

The approach is called "Recursively-substracted Empirical Orthogonal Functions" (RSEOF), which is specially designed PCA approach to deal with missing data. That said, I wouldn't have much confidence with the predictions without a larger training dataset.

So, I started by loading in your original and prediction datasets and reshaped the training data into a matrix using acast from the reshape2 package:

library(reshape2)
library(sinkr) (download from GitHub: https://github.com/menugget/sinkr)

# Original data
df1 <- data.frame(user=factor(c(0,0,1,1,2)), movie=factor(c(0,1,1,2,2)), rank=c(9,8,4,6,7))
df1

# Data to predict
df2 <-data.frame(user=factor(c(0,1,2,2)), movie=factor(c(2,0,0,1)))
df2

# Re-organize data into matrix(movies=rows, users=columns)
m1 <- acast(df1, movie ~ user, fill=NaN)
m1

Then using the eof function of the sinkr package (link), we perform the RSEOF:

# PCA of m1 (using recursive SVD)
E <- eof(m1, method="svd", recursive=TRUE, center=FALSE, scale=FALSE)
E$u
E$A #(like "v" but with Lambda units added)
E$Lambda

Predicted values for the NaN positions in the data can be obtained by reconstructing the full matrix with the PCA information (Basically E$A %*% t(E$u)):

# Reconstruct full m1 matrix using PCs
R <- eofRecon(E)
R

# Add predicted ranks to df2
pos <- (as.numeric(df2$user)-1)*length(levels(df1$movie)) + as.numeric(df2$movie)
pos
df2$rank <- R[pos]
df2

The object df2 contains the specific predicted ranks for the user/movie combinations that you specified in your prediction dataset:

  user movie     rank
1    0     2 9.246148
2    1     0 7.535567
3    2     0 6.292984
4    2     1 5.661985

I personally think that these values make more sense than your expected result (all around 7). For example, when looking at the matrix of movies (rows) by users (columns), m1,

    0   1   2
0   9 NaN NaN
1   8   4 NaN
2 NaN   6   7

I would expect that user "0" would like movie "2" more than movie "1", given that this is the trend for user "1". We only have rankings for movie "1" in common between them by which to base our predictions. Your expected value was 7.05, which would have been lower than for movie "1" (i.e. 8), whereas the RSEOF prediction is 9.2.

I hope this helps you out - but, if your expected answer is what you are shooting for, then I would have doubts about the method used by the "truth holder". It is more likely that you have simply provided a smaller version of your dataset, and thus we are not going to arrive at the same answer as in your smaller reproducible example.

like image 79
Marc in the box Avatar answered Jan 18 '23 11:01

Marc in the box


This is a classic matrix completion problem where we replace unknown values with zeroes in the data matrix. You need to first take the eigendecomposition of your data matrix (since it's symmetric, but SVD is equivalent, notice how U==V). Then you have A_pred = UEU^T, where A_pred is the predicted complete version of A (your data matrix). Thus your predicted value of A[i][j] is simply A_pred[i][j].

like image 31
vrume21 Avatar answered Jan 18 '23 11:01

vrume21