Need help in latent semantic indexing

Tags:

I am sorry, if my question sounds stupid :) Can you please recommend me any pseudo code or good algo for LSI implementation in java? I am not math expert. I tried to read some articles on wikipedia and other websites about LSI ( latent semantic indexing ) they were full of math. I know LSI is full of math. But if i see some source code or algo. I understand things more easily. That's why i asked here, because so many GURU are here ! Thanks in advance

639

asked Jan 07 '10 02:01

user238384

1 Answers

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".

Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.

Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.

Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.

But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.

So, algorithm for LSA will look something like this:

Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.

If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.

Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

114

answered Sep 23 '22 01:09

ffriend

Related questions
                            
                                Is there an implementation of the ActiveRecord pattern in Java like the one from Ruby? [closed]
                            
                                Java stack trace on Windows
                            
                                WCF Interop with Axis2 using WS-Trust
                            
                                inter jvm communication [closed]
                            
                                JFileChooser on OS X
                            
                                Change Default RMI Port (Java)
                            
                                Ant classpath Order
                            
                                Compiling Java code written for 1.5 to work with 1.4 JRE?
                            
                                clone utility for HashMap in java
                            
                                Java IOException: No buffer space available while sending UDP packets on Linux
                            
                                Are there any alternatives to Rational Team Concert at the moment? [closed]
                            
                                Storing long strings (CLOB) in Hsqldb databases?
                            
                                What is the .Net equivalent of Java's Dynamic Proxies?
                            
                                Java Beans: Overglorified Associative Arrays?
                            
                                How to force Maven to download maven-metadata.xml from the central repository?
                            
                                Returning C array to Java using JNA
                            
                                Interfacing R to Java
                            
                                JRuby limitations when working with Java Classes
                            
                                Java enum to mysql enum in prepared statement
                            
                                how to reference an external jar in jsp app?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Need help in latent semantic indexing

Tags:

java

algorithm

math

latent-semantic-indexing

user238384

People also ask

1 Answers

ffriend

Recent Activity

Donate For Us