I'm participating in programming contest, where I have data where the first column is a user, second column is a movie, and the third is a number in ten-points rating system.
0 0 9
0 1 8
1 1 4
1 2 6
2 2 7
And I have to predict the third column (user, movie, ?):
0 2
1 0
2 0
2 1
Also I know the answers:
0 2 7.052009
1 0 6.687943
2 0 6.995272
2 1 6.687943
This data in a table: Rows are users 0, 1 and 2; columns are movies 0, 1 and 2; cells are scores, 0
were not voted on:
[,1] [,2] [,3]
[1,] 9 8 0
[2,] 0 4 6
[3,] 0 0 7
I use R lang for get SVD:
$d
[1] 12.514311 9.197763 2.189331
$u
[,1] [,2] [,3]
[1,] 0.9318434 -0.3240669 0.1632436
[2,] 0.3380257 0.6116879 -0.7152458
[3,] 0.1319333 0.7216776 0.6795403
$v
[,1] [,2] [,3]
[1,] 0.6701600 -0.31709904 0.6710691
[2,] 0.7037423 -0.01584988 -0.7102785
[3,] 0.2358650 0.94825998 0.2125341
Transposed v is:
[,1] [,2] [,3]
[1,] 0.6701600 0.7037423 0.2358650
[2,] -0.31709904 -0.01584988 0.94825998
[3,] 0.6710691 -0.7102785 0.2125341
And I read about predicting movie ratings using this formula:
But I don't understand how to predict ratings like this:
0 2 7.052009
1 0 6.687943
2 0 6.995272
2 1 6.687943
For this data:
0 2
1 0
2 0
2 1
There are several things that seem incorrect to me with your example. First, when you don't have a ranking available for a specific user / movie combination, then you should not fill this with zero. This would tell SVD or any other type of principal component analysis (PCA) that these are the ranks (which are artificially low). Furthermore, covariances calculated with zero-filled data would be computed based on an incorrect number of observations.
The Netflix prize winner (link for more info) that utilized the SVD approach also must have used some sort of missing data PCA routine. In that case, the non-values should not be zero, but rather NaN
, although I haven't seen the details of the actual approach that they used.
The second question that I have is if "the answer" that you provide is really based on the rather small dataset that you give in the example. Given the 3 users by 3 movie dataset, there are very few locations for the calculation of correlations between users, so any prediction will be very poor. Nevertheless, I was able to produce a result, but it doesn't match your expected answer.
The approach is called "Recursively-substracted Empirical Orthogonal Functions" (RSEOF), which is specially designed PCA approach to deal with missing data. That said, I wouldn't have much confidence with the predictions without a larger training dataset.
So, I started by loading in your original and prediction datasets and reshaped the training data into a matrix using acast
from the reshape2
package:
library(reshape2)
library(sinkr) (download from GitHub: https://github.com/menugget/sinkr)
# Original data
df1 <- data.frame(user=factor(c(0,0,1,1,2)), movie=factor(c(0,1,1,2,2)), rank=c(9,8,4,6,7))
df1
# Data to predict
df2 <-data.frame(user=factor(c(0,1,2,2)), movie=factor(c(2,0,0,1)))
df2
# Re-organize data into matrix(movies=rows, users=columns)
m1 <- acast(df1, movie ~ user, fill=NaN)
m1
Then using the eof
function of the sinkr
package (link), we perform the RSEOF:
# PCA of m1 (using recursive SVD)
E <- eof(m1, method="svd", recursive=TRUE, center=FALSE, scale=FALSE)
E$u
E$A #(like "v" but with Lambda units added)
E$Lambda
Predicted values for the NaN
positions in the data can be obtained by reconstructing the full matrix with the PCA information (Basically E$A %*% t(E$u)
):
# Reconstruct full m1 matrix using PCs
R <- eofRecon(E)
R
# Add predicted ranks to df2
pos <- (as.numeric(df2$user)-1)*length(levels(df1$movie)) + as.numeric(df2$movie)
pos
df2$rank <- R[pos]
df2
The object df2
contains the specific predicted ranks for the user/movie combinations that you specified in your prediction dataset:
user movie rank
1 0 2 9.246148
2 1 0 7.535567
3 2 0 6.292984
4 2 1 5.661985
I personally think that these values make more sense than your expected result (all around 7). For example, when looking at the matrix of movies (rows) by users (columns), m1
,
0 1 2
0 9 NaN NaN
1 8 4 NaN
2 NaN 6 7
I would expect that user "0" would like movie "2" more than movie "1", given that this is the trend for user "1". We only have rankings for movie "1" in common between them by which to base our predictions. Your expected value was 7.05, which would have been lower than for movie "1" (i.e. 8), whereas the RSEOF prediction is 9.2.
I hope this helps you out - but, if your expected answer is what you are shooting for, then I would have doubts about the method used by the "truth holder". It is more likely that you have simply provided a smaller version of your dataset, and thus we are not going to arrive at the same answer as in your smaller reproducible example.
This is a classic matrix completion problem where we replace unknown values with zeroes in the data matrix. You need to first take the eigendecomposition of your data matrix (since it's symmetric, but SVD is equivalent, notice how U==V). Then you have A_pred = UEU^T, where A_pred is the predicted complete version of A (your data matrix). Thus your predicted value of A[i][j] is simply A_pred[i][j].
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With