Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between LDA and NTM in Amazon Sagemaker for Topic Modeling?

I am looking for difference between LDA and NTM . What are some use case where you will use LDA over NTM?

As per AWS doc:

LDA : The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus.

Although you can use both the Amazon SageMaker NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data.

like image 518
Saurabh Avatar asked Mar 04 '23 02:03

Saurabh


1 Answers

LDA and NTM have different scientific logic:

SageMaker LDA (Latent Dirichlet Allocation, not to be confused with Linear Discriminant Analysis) model works by assuming that documents are formed by sampling words from a finite set of topics. It is made of 2 moving parts: (1) the word composition per topic and (2) the topic composition per document

SageMaker NTM on the other hand doesn't explicitly learn a word distribution per topic, it is a neural network that passes document through a bottleneck layer and tries to reproduce the input document (presumably a Variational Auto Encoder (VAE) according to AWS documentation). That means that the bottleneck layer ends up containing all necessary information to predict document composition and its coefficients can be considered as topics

Here are considerations for choosing one or the other:

  1. VAE-based method such as SageMaker NTM may do a better job of discerning relevant topics than LDA, presumably because of their possibly deeper expressive power. A benchmark here (featuring a VAE-NTM that could be different that SageMaker NTM) shows that NTMs can beat LDA in both metrics of topic coherence and perplexity
  2. So far there seems to be more community knowledge about LDA than about VAEs, NTMs and SageMaker NTM. That means a possibly easier learning and troubleshooting path if you play with LDAs. Things change fast though, so this point may be less and less relevant as DL knowledge grows
  3. SageMaker NTM has more flexible hardware options than SageMaker LDA and may scale better: SageMaker NTM can run on CPU, GPU, multi-GPUs instances and multi-instance context. For example, the official NTM demo uses an ephemeral cluster of 2 ml.c4.xlarge instances. SageMaker LDA currently only support single-instance CPU training.
like image 194
Olivier Cruchant Avatar answered Mar 06 '23 08:03

Olivier Cruchant