I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't got complete understanding of the various calculations that goes in it. I am wondering can someone show me the calculations using a very small corpus (let say of 3-5 sentences and 2-3 topics).
For the LDA model, we first need to build a dictionary of words where each word is given a unique id. Then need to create a corpus which contains word id mapping with word_frequency — ->(word_id, word_frequency). Finally, train the model. Coherence measures the relative distance between words within a topic.
Latent Dirichlet Allocation (LDA) is a popular form of statistical topic modeling. In LDA, documents are represented as a mixture of topics and a topic is a bunch of words. Those topics reside within a hidden, also known as a latent layer.
Edwin Chen (who works at Twitter btw) has an example in his blog. 5 sentences, 2 topics:
Then he does some "calculations"
And take guesses of the topics:
Your question is how did he come up with those numbers? Which words in these sentences carry "information":
Now let's go sentence by sentence getting words from each topic:
So my numbers, differ slightly from Chen's. Maybe he includes the word "piece" in "piece of broccoli" as counting towards food.
We made two calculations in our heads:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With