Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

Question

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.

Jason Lenderman · Accepted Answer

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:

newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

Tags:

apache-spark

apache-spark-mllib

lda

topic-modeling

Rami

1 Answers

Jason Lenderman

Recent Activity

Donate For Us

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

Tags:

apache-spark

apache-spark-mllib

lda

topic-modeling

Rami

1 Answers

Jason Lenderman

Related questions

Recent Activity

Donate For Us