Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need help in latent semantic indexing

I am sorry, if my question sounds stupid :) Can you please recommend me any pseudo code or good algo for LSI implementation in java? I am not math expert. I tried to read some articles on wikipedia and other websites about LSI ( latent semantic indexing ) they were full of math. I know LSI is full of math. But if i see some source code or algo. I understand things more easily. That's why i asked here, because so many GURU are here ! Thanks in advance

like image 639
user238384 Avatar asked Jan 07 '10 02:01

user238384


People also ask

Which option would you use for latent semantic indexing?

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

What is the idea behind latent semantic indexing what is its goal How does it work?

Latent Semantic Indexing, also known as latent semantic analysis, is a mathematical practice that helps classify and retrieve information on particular key terms and concepts using singular value decomposition (SVD).

Does Google use latent semantic indexing?

“Google relies on LSI keywords to understand content at such a deep level.” “LSI Keywords are NOT synonyms. Instead, they're terms that are closely tied to your target keyword.” “Google doesn't ONLY bold terms that exactly match what you just searched for (in search results).

What is latent semantic indexing keywords?

LSI (Latent Semantic Indexing) keywords are words that are related to a main keyword and are seen as semantically relevant. If your page's primary keyword is 'credit cards,' then LSI keywords would be things like “money,” “credit score,” “credit limit,” or “interest rate.”


1 Answers

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".

Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.

Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.

Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.

But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.

So, algorithm for LSA will look something like this:

  1. Collect all documents and all unique words from them.
  2. Extract frequency information and build co-occurrence matrix.
  3. Reduce matrix with SVD.

If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.

Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

like image 114
ffriend Avatar answered Sep 23 '22 01:09

ffriend