How to update Spark MatrixFactorizationModel for ALS

Tags:

I build a simple recommendation system for the MovieLens DB inspired by https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html.

I also have problems with explicit training like here: Apache Spark ALS collaborative filtering results. They don't make sense Using implicit training (on both explicit and implicit data) gives me reasonable results, but explicit training doesn't.

While this is ok for me by now, im curious on how to update a model. While my current solution works like

having all user ratings
generate model
get recommendations for user

I want to have a flow like this:

having a base of ratings
generate model once (optional save & load it)
get some ratings by one user on 10 random movies (not in the model!)
get recommendations using the model and the new user ratings

Therefore I must update my model, without completely recompute it. Is there any chance to do so?

While the first way is good for batch processing (like generating recommendations in nightly batches) the second way would be good for nearly-live generating of recommendations.

638

asked May 28 '15 14:05

mniehoff

1 Answers

Edit: the following worked for me because I had implicit feedback ratings and was only interesting in ranking the products for a new user. More details here

You can actually get predictions for new users using the trained model (without updating it):

To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.

i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)

To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.

Note: MLLIB gives you access to the matrix u and v

answered Sep 17 '22 16:09

yoh.lej

Related questions
                            
                                Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary
                            
                                Spark LDA consumes too much memory
                            
                                apache spark "Py4JError: Answer from Java side is empty"
                            
                                SparkUI for pyspark - corresponding line of code for each stage?
                            
                                How to read/write protocol buffer messages with Apache Spark?
                            
                                In Apache Spark, how to convert a slow RDD/dataset into a stream?
                            
                                What is happening when Spark is calling ShuffleBlockFetcherIterator?
                            
                                spark parquet write gets slow as partitions grow
                            
                                Unable to understand error "SparkListenerBus has already stopped! Dropping event ..."
                            
                                How are number of iterations and number of partitions releated in Apache spark Word2Vec?
                            
                                Spark: Difference between collect(), take() and show() outputs after conversion toDF
                            
                                Spark: Most efficient way to sort and partition data to be written as parquet
                            
                                Why increase spark.yarn.executor.memoryOverhead?
                            
                                Read an unsupported mix of union types from an Avro file in Apache Spark
                            
                                Exception with Table identified via AWS Glue Crawler and stored in Data Catalog
                            
                                Can't start Apache Spark on Windows using Cygwin
                            
                                Spark - Container is running beyond physical memory limits
                            
                                How to balance my data across the partitions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to update Spark MatrixFactorizationModel for ALS

Tags:

machine-learning

apache-spark

apache-spark-mllib

collaborative-filtering

mniehoff

People also ask

1 Answers

yoh.lej

Recent Activity

Donate For Us