Incremental training of ALS model

1 Answers

I imagine you are using spark MLlib's ALS model which is performing matrix factorization. The result of the model are two matrices a user-features matrix and an item-features matrix.

Assuming we are going to receive a stream of data with ratings or transactions for the case of implicit, a real (100%) online update of this model will be to update both matrices for each new rating information coming by triggering a full retrain of the ALS model on the entire data again + the new rating. In this scenario one is limited by the fact that running the entire ALS model is computationally expensive and the incoming stream of data could be frequent, so it would trigger a full retrain too often.

So, knowing this we can look for alternatives, a single rating should not change the matrices much plus we have optimization approaches which are incremental, for example SGD. There is an interesting (still experimental) library written for the case of Explicit Ratings which does incremental updates for each batch of a DStream:

https://github.com/brkyvz/streaming-matrix-factorization

The idea of using an incremental approach such as SGD follows the idea of as far as one moves towards the gradient (minimization problem) one guarantees that is moving towards a minimum of the error function. So even if we do an update to the single new rating, only to the user feature matrix for this specific user, and only the item-feature matrix for this specific item rated, and the update is towards the gradient, we guarantee that we move towards the minimum, of course as an approximation, but still towards the minimum.

The other problem comes from spark itself, and the distributed system, ideally the updates should be done sequentially, for each new incoming rating, but spark treats the incoming stream as a batch, which is distributed as an RDD, so the operations done for updating would be done for the entire batch with no guarantee of sequentiality.

In more details if you are using Prediction.IO for example, you could do an off line training which uses the regular train and deploy functions built in, but if you want to have the online updates you will have to access both matrices for each batch of the stream, and run updates using SGD, then ask for the new model to be deployed, this functionality of course is not in Prediction.IO you would have to build it on your own.

Interesting notes for SGD updates:

http://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf

171

answered Oct 17 '22 13:10

Dr VComas

Related questions
                            
                                Modify collection inside a Spark RDD foreach
                            
                                PySpark — UnicodeEncodeError: 'ascii' codec can't encode character
                            
                                Replace missing values with mean - Spark Dataframe
                            
                                Spark-Submit: --packages vs --jars
                            
                                How do you perform basic joins of two RDD tables in Spark using Python?
                            
                                Spark RDD default number of partitions
                            
                                How can I get the current SparkSession in any place of the codes?
                            
                                Not able to import Spark Implicits in ScalaTest
                            
                                How to read only n rows of large CSV file on HDFS using spark-csv package?
                            
                                How to convert column of arrays of strings to strings?
                            
                                setting SparkContext for pyspark
                            
                                pyspark dataframe add a column if it doesn't exist
                            
                                Why is the error "Unable to find encoder for type stored in a Dataset" when encoding JSON using case classes?
                            
                                How to check if list contains all the same values?
                            
                                Show partitions on a pyspark RDD
                            
                                How to resolve external packages with spark-shell when behind a corporate proxy?
                            
                                How to create hive table from Spark data frame, using its schema?
                            
                                How to get the number of elements in partition? [duplicate]
                            
                                Stratified sampling with pyspark
                            
                                How to augment matrix factors in Spark ALS recommender? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Incremental training of ALS model

Tags:

machine-learning

apache-spark

prediction

apache-spark-mllib

predictionio

Wouter

People also ask

1 Answers

Dr VComas

Recent Activity

Donate For Us