Probabilistic latent semantic analysis/Indexing - Introduction

Tags:

But recently I found this link quite helpful to understand the principles of LSA without too much math. http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html. It forms a good basis on which I can build further.

currently, I'm looking out for a similar introduction to Probabilistic Latent Semantic Analysis/Indexing. Less of math and more of examples explaining the principles behind it. If you would know such an introduction, please let me know.

Can it be used to find the measure of similarity between sentences? Does it handle polysemy?

Is there a python implementation for the same?

Thank you.

852

asked Jun 26 '11 06:06

Sharmila

1 Answers

There is a good talk by Thomas Hofmann that explains both LSA and its connections to Probabilistic Latent Semantic Analysis (PLSA). The talk has some math, but is much easier to follow than the PLSA paper (or even its Wikipedia page).

PLSA can be used to get some similarity measure between sentences, as two sentences can be viewed as short documents drawn from a probability distribution over latent classes. Your similarity will heavily depend on your training set though. The documents you use to training the latent class model should reflect the types of documents you want to compare. Generating a PLSA model with two sentences won't create meaningful latent classes. Similarly, training with a corpus of very similar contexts may create latent classes that are overly sensitive to slight changes on the documents. Moreover, because sentences contain relative few tokens (as compared to documents), I don't believe you'll get high quality similarity results from PLSA at the sentence level.

PLSA does not handle polysemy. However, if you are concerned with polysemy, you might try running a Word Sense Disambiguation tool over your input text to tag each word with its correct sense. Running PLSA (or LDA) over this tagged corpus will remove the effects of polysemy in the resulting document representations.

As Sharmila noted, Latent Dirichlet allocation (LDA) is considered the state of the art for document comparison, and is superior to PLSA, which tends to overfit the training data. In addition, there are many more tools to support LDA and analyze whether the results you get with LDA are meaningful. (If you're feeling adventurous, you can read David Mimno's two papers from EMNLP 2011 on how to assess the quality of the latent topics you get from LDA.)

132

answered Sep 27 '22 20:09

David Jurgens

Related questions
                            
                                Does this neural network model exist?
                            
                                Feature extraction from a single word
                            
                                Remove stop words from the parsed content using OpenNLP
                            
                                Is UIMA provides only a wrapper or is it like StandfordCore NLP and GATE?
                            
                                Word and Text relation using python and NLP
                            
                                How do people use n-grams for sentiment analysis, considering that as n increases, the memory requirement also increases rapidly?
                            
                                Subjectivity and objectivity detection
                            
                                SimpleNLG - How to get the plural of a noun?
                            
                                Why are there different Lemmatizers in NLTK library?
                            
                                Use spacy Spanish Tokenizer
                            
                                How do you concatenate symbols in mxnet
                            
                                How can I access the raw documents from the Brown corpus?
                            
                                How to split a sentence string into words, but also make punctuation a separate element
                            
                                Gensim Word2Vec model getting worse by increasing the number of epochs
                            
                                Run GATE pipeline from inside a Java program without the GUI. build a tomcat app with gate
                            
                                Transformation-Based Part-of-Speech Tagging(Brill Tagging)
                            
                                Automatic text translation
                            
                                How do I count words in an nltk plaintextcorpus faster?
                            
                                Convert chinese characters to hanyu pinyin
                            
                                Help: Extracting data tuples from text... Regex or Machine learning?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Probabilistic latent semantic analysis/Indexing - Introduction

Tags:

nlp

latent-semantic-indexing

lsa

Sharmila

People also ask

1 Answers

David Jurgens

Recent Activity

Donate For Us