Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Latent Semantic Analysis, how do you recombine the decomposed matrices after truncating the singular values?

I'm reading Matrix decompositions and latent semantic indexing (Online edition © 2009 Cambridge UP)

I'm trying to understand how you reduce the number of dimensions in a matrix. There's an example on page 13 which I'm trying to replicate using Python's numpy.

Let's call the original occurrence matrix "a" and the three SVD (Singular Value Decomposition) decomposed matrices "U", "S" and "V".

The trouble I'm having is that after I zero out the smaller singular values in "S", when I multiply together "U", "S" and "V" using numpy, the answer is not as it is given in the pdf. The bottom 3 rows are not all zeros. The funny thing is that when I just multiply "S" and "V" I get the right answer.

This is sort of surprising but multiplying "S" and "V" is actually what Manning and Schutze's book Foundations of Statistical Natural Language Processing says you have to do. But this is not what the pdf says you have to do in page 10.

So what's going on here?

like image 446
mtanti Avatar asked Jan 02 '14 20:01

mtanti


People also ask

What is used in Latent Semantic Analysis to Factorize matrices?

A matrix where each element shows how often words appear in a text. LSA uses an advanced matrix algebra method called Singular Value Decomposition (SVD) to factorize matrices . SVD is usually impractical to perform by hand for anything more than a small sample of text.

How does Latent Semantic Analysis work?

Latent Semantic Analysis is a natural language processing method that analyzes relationships between a set of documents and the terms contained within. It uses singular value decomposition, a mathematical technique, to scan unstructured data to find hidden relationships between terms and concepts.

What is singular value decomposition in NLP?

Singular value decomposition (SVD) is a means of decomposing a a matrix into a product of three simpler matrices. In this way it is related to other matrix decompositions such as eigen decomposition, principal components analysis (PCA), and non-negative matrix factorization (NNMF).

Why is LSA Latent Semantic Analysis low rank approximation on a term-document matrix carried out?

Then, the LSA uses a low-rank approximation to the term-document matrix in order to remove irrelevant information, to extract more important relations, and to reduce the computational time. The irrelevant information is called as “noise” and does not have a noteworthy effect on the meaning of the document collection.


1 Answers

Multiplying S and V is exactly what you have to do to perform dimensionality reduction with SVD/LSA.

>>> C = np.array([[1, 0, 1, 0, 0, 0],
...               [0, 1, 0, 0, 0, 0],
...               [1, 1, 0, 0, 0, 0],
...               [1, 0, 0, 1, 1, 0],
...               [0, 0, 0, 1, 0, 1]])
>>> from scipy.linalg import svd
>>> U, s, VT = svd(C, full_matrices=False)
>>> s[2:] = 0
>>> np.dot(np.diag(s), VT)
array([[ 1.61889806,  0.60487661,  0.44034748,  0.96569316,  0.70302032,
         0.26267284],
       [-0.45671719, -0.84256593, -0.29617436,  0.99731918,  0.35057241,
         0.64674677],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ]])

This gives a matrix where all but the last few rows are zeros, so they can be removed, and in practice this is the matrix you would use in applications:

>>> np.dot(np.diag(s[:2]), VT[:2])
array([[ 1.61889806,  0.60487661,  0.44034748,  0.96569316,  0.70302032,
         0.26267284],
       [-0.45671719, -0.84256593, -0.29617436,  0.99731918,  0.35057241,
         0.64674677]])

What the PDF describes on page 10 is the recipe to get a low-rank reconstruction of the input C. Rank != dimensionality, and the shear size and density of the reconstruction matrix make it impractical to use in LSA; its purpose is mostly mathematical. One thing you can do with it is check how good the reconstruction is for various values of k:

>>> U, s, VT = svd(C, full_matrices=False)
>>> C2 = np.dot(U[:, :2], np.dot(np.diag(s[:2]), VT[:2]))
>>> from scipy.spatial.distance import euclidean
>>> euclidean(C2.ravel(), C.ravel())   # Frobenius norm of C2 - C
1.6677932876555255
>>> C3 = np.dot(U[:, :3], np.dot(np.diag(s[:3]), VT[:3]))
>>> euclidean(C3.ravel(), C.ravel())
1.0747879905228703

Sanity check against scikit-learn's TruncatedSVD (full disclosure: I wrote that):

>>> from sklearn.decomposition import TruncatedSVD
>>> TruncatedSVD(n_components=2).fit_transform(C.T)
array([[ 1.61889806, -0.45671719],
       [ 0.60487661, -0.84256593],
       [ 0.44034748, -0.29617436],
       [ 0.96569316,  0.99731918],
       [ 0.70302032,  0.35057241],
       [ 0.26267284,  0.64674677]])
like image 171
Fred Foo Avatar answered Sep 28 '22 17:09

Fred Foo