I've read about using Singular Value Decomposition (SVD) to do Latent Semantic Analysis (LSA) in corpus of texts. I've understood how to do that, also I understand mathematical concepts of SVD.
But I don't understand why does it works applying to corpuses of texts (I believe - there must be linguistical explanation). Could anybody explain me this with linguistic point of view?
Thanks
LSA deals with the following kind of issue: Example: mobile, phone, cell phone, telephone are all similar but if we pose a query like “The cell phone has been ringing” then the documents which have “cell phone” are only retrieved whereas the documents containing the mobile, phone, telephone are not retrieved.
Latent semantic analysis (LSA) is a mathematical method for computer modeling and simulation of the meaning of words and passages by analysis of representative corpora of natural text. LSA closely approximates many aspects of human language learning and understanding.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
Latent semantic indexing (also referred to as Latent Semantic Analysis) is a method of analyzing a set of documents in order to discover statistical co-occurrences of words that appear together which then give insights into the topics of those words and documents.
There is no linguistic interpretation, there is no syntax involved, no handling of equivalence classes, synonyms, homonyms, stemming etc. Neither are any semantics involved, it is just words-occuring-together. Consider a "document" as a shopping cart: it contains a combination of words (purchases). And words tend to occur together with "related" words.
For instance: The word "drug" can occur together with either of {love, doctor, medicine, sports, crime}; each will point you in a different direction. But combined with many other words in the document, your query will probably find documents from a similar field.
Words occurring together (i.e. nearby or in the same document in a corpus) contribute to context. Latent Semantic Analysis basically groups similar documents in a corpus based on how similar they are to each other in terms of context.
I think the example and the word-document plot on this page will help in understanding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With