I wanted to try out Spark for collaborative filtering using MLlib as explained in this tutorial: https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html The algorithm is based on the paper "Collaborative Filtering for Implicit Feedback Datasets", doing matrix factorization.
Everything is up and running using the 10 million Movielens data set. The data set it split into 80% training 10% test and 10% validation.
Which are values similar to the tutorial, although with different training parameters.
I tried running the algorithm several times and always got recommendations that don't make any sense to me. Even rating only kids movies I get the following results:
For ratings:
Results:
Movies recommended for you:
Which except for Only Yesterday doesn't seem to make any sense.
If there is anyone out there who knows how to interpret those results or get better ones I would really appreciate you sharing your knowledge.
Best regards
EDIT:
As suggested I trained another model with more factors:
And different personal ratings:
The recommended movies are:
Movies recommended for you:
Not one useful result.
EDIT2: With using the implicit feedback method, I get much better results! With the same action movies as above the recommendations are:
Movies recommended for you:
That's more what I expected! The question is why the explicit version is so-so-so bad
ALS is implemented in Apache Spark ML and built for a larges-scale collaborative filtering problems. ALS is doing a pretty good job at solving scalability and sparseness of the Ratings data, and it's simple and scales well to very large datasets.
We then train an ALS model which assumes, by default, that the ratings are explicit ( implicitPrefs is false ). We evaluate the recommendation model by measuring the root-mean-square error of rating prediction. Refer to the ALS Scala docs for more details on the API.
Recommendation using Alternating Least Squares (ALS) The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.
rank is the number of latent factors in the model (defaults to 10). maxIter is the maximum number of iterations to run (defaults to 10). regParam specifies the regularization parameter in ALS (defaults to 1.0).
Note that the code you are running does not use implicit feedback, and is not quite the algorithm you refer to. Just make sure you are not using ALS.trainImplicit
. You may need a different, lambda and rank. RMSE of 0.88 is "OK" for this data set; I am not clear that the example's values are optimal or just the one that the toy test produced. You use a different value still here. Maybe it's just not optimal yet.
It could even be stuff like bugs in the ALS implementation fixed since. Try comparing to another implementation of ALS if you can.
I always try to resist rationalizing the recommendations since our brains inevitably find some explanation even for random recommendations. But, hey, I can say that you did not get action, horror, crime drama, thrillers here. I find that kids movies go hand in hand with taste for arty movies, since, the kind of person who filled out their tastes for MovieLens way back when and rated kids movies were not actually kids, but parents, and maybe software engineer types old enough to have kids do tend to watch these sorts of foreign films you see.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With