Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Probabilistic latent semantic analysis/Indexing - Introduction

But recently I found this link quite helpful to understand the principles of LSA without too much math. http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html. It forms a good basis on which I can build further.

currently, I'm looking out for a similar introduction to Probabilistic Latent Semantic Analysis/Indexing. Less of math and more of examples explaining the principles behind it. If you would know such an introduction, please let me know.

Can it be used to find the measure of similarity between sentences? Does it handle polysemy?

Is there a python implementation for the same?

Thank you.

like image 852
Sharmila Avatar asked Jun 26 '11 06:06

Sharmila


People also ask

What is latent semantic indexing model?

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

What is the main difference between latent semantic analysis and Probabilistic Latent Semantic Analysis?

Compared to standard latent semantic analysis which stems from linear algebra and downsizes the occurrence tables (usually via a singular value decomposition), probabilistic latent semantic analysis is based on a mixture decomposition derived from a latent class model.

What is pLSA in machine learning?

pLSA stands for Probabilistic Latent Semantic Analysis, uses a probabilistic method instead of Singular Value Decomposition, which we used in LSA to tackle the problem. The main goal is to find a probabilistic model with latent or hidden topics that can generate the data which we observe in our document-term matrix.

What is meant by latent and semantic analysis?

Latent semantic analysis (LSA) is a mathematical method for computer modeling and simulation of the meaning of words and passages by analysis of representative corpora of natural text. LSA closely approximates many aspects of human language learning and understanding.


1 Answers

There is a good talk by Thomas Hofmann that explains both LSA and its connections to Probabilistic Latent Semantic Analysis (PLSA). The talk has some math, but is much easier to follow than the PLSA paper (or even its Wikipedia page).

PLSA can be used to get some similarity measure between sentences, as two sentences can be viewed as short documents drawn from a probability distribution over latent classes. Your similarity will heavily depend on your training set though. The documents you use to training the latent class model should reflect the types of documents you want to compare. Generating a PLSA model with two sentences won't create meaningful latent classes. Similarly, training with a corpus of very similar contexts may create latent classes that are overly sensitive to slight changes on the documents. Moreover, because sentences contain relative few tokens (as compared to documents), I don't believe you'll get high quality similarity results from PLSA at the sentence level.

PLSA does not handle polysemy. However, if you are concerned with polysemy, you might try running a Word Sense Disambiguation tool over your input text to tag each word with its correct sense. Running PLSA (or LDA) over this tagged corpus will remove the effects of polysemy in the resulting document representations.

As Sharmila noted, Latent Dirichlet allocation (LDA) is considered the state of the art for document comparison, and is superior to PLSA, which tends to overfit the training data. In addition, there are many more tools to support LDA and analyze whether the results you get with LDA are meaningful. (If you're feeling adventurous, you can read David Mimno's two papers from EMNLP 2011 on how to assess the quality of the latent topics you get from LDA.)

like image 132
David Jurgens Avatar answered Sep 27 '22 20:09

David Jurgens