Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Online learning of LDA model in Spark

Is there a way to train a LDA model in an online-learning fashion, ie. loading a previously train model, and update it with new documents ?

like image 700
mathieu Avatar asked Mar 08 '17 18:03

mathieu


2 Answers

Answering myself : it is not possible as of now.

Actually, Spark has 2 implementations for LDA model training, and one is OnlineLDAOptimizer. This approach is especially designed to incrementally update the model with mini batches of documents.

The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.

Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010.

Unfortunately, the current mllib API does not allow to load a previously trained LDA model, and add a batch to it.

Some mllib models support an initialModel as starting point for incremental updates (see KMeans, or GMM), but LDA does not currently support that. I filled a JIRA for it : SPARK-20082. Please upvote ;-)

For the record, there's also a JIRA for streaming LDA SPARK-8696

like image 136
mathieu Avatar answered Oct 13 '22 20:10

mathieu


I don't think that such a thing would exist. LDA is probabilistic parameter estimation algorithm ( a very simplified explanation of the process here LDA explained), and adding a document or a few would change all previously computed probabilities, so literally recompute the model.

I don't know about your use case, but you can think about doing an update by batch if your model converges in a reasonable time and discard some of the oldest document at each re-computation to make the estimation faster.

like image 21
ML_TN Avatar answered Oct 13 '22 18:10

ML_TN