LSA - Latent Semantic Analysis - How to code it in PHP?

1 Answers

LSA links:

Landauer (co-creator) article on LSA
the R-project lsa user guide

Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.

Assumptions:

your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.

M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).

U,Sigma,V = singular_value_decomposition(M)

U:  w x w
Sigma:  min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values
V:  d x d matrix

Thus U * Sigma * V = M  
#  you might have to do some transposes depending on how your SVD code 
#  returns U and V.  verify this so that you don't go crazy :)

Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.

More succintly... (pseudocode)

Let s1 = sum(Sigma).  
total = 0
for ii in range(len(Sigma)):
    val = Sigma[ii]
    total += val
    if total > .5 * s1:
        return ii

This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.

(here, ' -> prime, not transpose)

We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.

That's the essence of the LSA algorithm.

This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.

To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).

Your mileage will definitely vary.

Tagging using LSA (one method!)

Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic
By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.
Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.

130

answered Oct 25 '22 10:10

Gregg Lind

Related questions
                            
                                Configuration File for Driving Selenium
                            
                                XDebug does not break on breakpoints from atom's php-debug package
                            
                                SSO for Laravel 5.3 Passport
                            
                                To many long running Apache processes in READ status without requests after a certain time
                            
                                Abstracting related functionality using interfaces vs tight coupling
                            
                                Laravel 5.4 - How to set PDO Fetch Mode?
                            
                                Google api php client code not return refresh token
                            
                                Why does Visual Studio Code show User Settings instead of running a PHP file?
                            
                                How to show laravel debugbar only to certain people?
                            
                                Getting a list of ALL plugins
                            
                                URL signature expired error when viewing Instagram images from Instagram API
                            
                                Unable to prepare route [api/user] for serialization. Uses Closure
                            
                                How to fix "Illuminate\Database\QueryException: SQLSTATE[HY000] [1044] Access denied for user"
                            
                                PHP generator return type
                            
                                Clicking Outlook safelink protection links in emails seems to be executing the code twice
                            
                                Can't connect a azure sql server from laravel on linux
                            
                                Queued Laravel Notifications get stuck on AWS SQS
                            
                                PHP Warning: Module 'ldap' & 'mysql' already loaded when running PHP at command line
                            
                                PHP to clean-up pasted Microsoft input
                            
                                PDO MySQL Driver on Mac

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

LSA - Latent Semantic Analysis - How to code it in PHP?

Tags:

php

semantics

tagging

linguistics

lsa

caw

People also ask

1 Answers

Gregg Lind

Recent Activity

Donate For Us