Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Latent Dirichlet Allocation Solution Example

I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't got complete understanding of the various calculations that goes in it. I am wondering can someone show me the calculations using a very small corpus (let say of 3-5 sentences and 2-3 topics).

like image 774
user737128 Avatar asked May 16 '12 18:05

user737128


People also ask

How do you implement Latent Dirichlet Allocation?

For the LDA model, we first need to build a dictionary of words where each word is given a unique id. Then need to create a corpus which contains word id mapping with word_frequency — ->(word_id, word_frequency). Finally, train the model. Coherence measures the relative distance between words within a topic.

What is a good explanation of latent Dirichlet allocation?

Latent Dirichlet Allocation (LDA) is a popular form of statistical topic modeling. In LDA, documents are represented as a mixture of topics and a topic is a bunch of words. Those topics reside within a hidden, also known as a latent layer.


1 Answers

Edwin Chen (who works at Twitter btw) has an example in his blog. 5 sentences, 2 topics:

  • I like to eat broccoli and bananas.
  • I ate a banana and spinach smoothie for breakfast.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster munching on a piece of broccoli.

Then he does some "calculations"

  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B

And take guesses of the topics:

  • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
    • at which point, you could interpret topic A to be about food
  • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …
    • at which point, you could interpret topic B to be about cute animals

Your question is how did he come up with those numbers? Which words in these sentences carry "information":

  • broccoli, bananas, smoothie, breakfast, munching, eat
  • chinchilla, kitten, cute, adopted, hampster

Now let's go sentence by sentence getting words from each topic:

  • food 3, cute 0 --> food
  • food 5, cute 0 --> food
  • food 0, cute 3 --> cute
  • food 0, cute 2 --> cute
  • food 2, cute 2 --> 50% food + 50% cute

So my numbers, differ slightly from Chen's. Maybe he includes the word "piece" in "piece of broccoli" as counting towards food.


We made two calculations in our heads:

  • to look at the sentences and come up with 2 topics in the first place. LDA does this by considering each sentence as a "mixture" of topics and guessing the parameters of each topic.
  • to decide which words are important. LDA uses "term-frequency/inverse-document-frequency" to understand this.
like image 80
john mangual Avatar answered Sep 22 '22 16:09

john mangual