I'm reading Matrix decompositions and latent semantic indexing (Online edition © 2009 Cambridge UP)
I'm trying to understand how you reduce the number of dimensions in a matrix. There's an example on page 13 which I'm trying to replicate using Python's numpy.
Let's call the original occurrence matrix "a" and the three SVD (Singular Value Decomposition) decomposed matrices "U", "S" and "V".
The trouble I'm having is that after I zero out the smaller singular values in "S", when I multiply together "U", "S" and "V" using numpy, the answer is not as it is given in the pdf. The bottom 3 rows are not all zeros. The funny thing is that when I just multiply "S" and "V" I get the right answer.
This is sort of surprising but multiplying "S" and "V" is actually what Manning and Schutze's book Foundations of Statistical Natural Language Processing says you have to do. But this is not what the pdf says you have to do in page 10.
So what's going on here?
A matrix where each element shows how often words appear in a text. LSA uses an advanced matrix algebra method called Singular Value Decomposition (SVD) to factorize matrices . SVD is usually impractical to perform by hand for anything more than a small sample of text.
Latent Semantic Analysis is a natural language processing method that analyzes relationships between a set of documents and the terms contained within. It uses singular value decomposition, a mathematical technique, to scan unstructured data to find hidden relationships between terms and concepts.
Singular value decomposition (SVD) is a means of decomposing a a matrix into a product of three simpler matrices. In this way it is related to other matrix decompositions such as eigen decomposition, principal components analysis (PCA), and non-negative matrix factorization (NNMF).
Then, the LSA uses a low-rank approximation to the term-document matrix in order to remove irrelevant information, to extract more important relations, and to reduce the computational time. The irrelevant information is called as “noise” and does not have a noteworthy effect on the meaning of the document collection.
Multiplying S
and V
is exactly what you have to do to perform dimensionality reduction with SVD/LSA.
>>> C = np.array([[1, 0, 1, 0, 0, 0],
... [0, 1, 0, 0, 0, 0],
... [1, 1, 0, 0, 0, 0],
... [1, 0, 0, 1, 1, 0],
... [0, 0, 0, 1, 0, 1]])
>>> from scipy.linalg import svd
>>> U, s, VT = svd(C, full_matrices=False)
>>> s[2:] = 0
>>> np.dot(np.diag(s), VT)
array([[ 1.61889806, 0.60487661, 0.44034748, 0.96569316, 0.70302032,
0.26267284],
[-0.45671719, -0.84256593, -0.29617436, 0.99731918, 0.35057241,
0.64674677],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ]])
This gives a matrix where all but the last few rows are zeros, so they can be removed, and in practice this is the matrix you would use in applications:
>>> np.dot(np.diag(s[:2]), VT[:2])
array([[ 1.61889806, 0.60487661, 0.44034748, 0.96569316, 0.70302032,
0.26267284],
[-0.45671719, -0.84256593, -0.29617436, 0.99731918, 0.35057241,
0.64674677]])
What the PDF describes on page 10 is the recipe to get a low-rank reconstruction of the input C
. Rank != dimensionality, and the shear size and density of the reconstruction matrix make it impractical to use in LSA; its purpose is mostly mathematical. One thing you can do with it is check how good the reconstruction is for various values of k
:
>>> U, s, VT = svd(C, full_matrices=False)
>>> C2 = np.dot(U[:, :2], np.dot(np.diag(s[:2]), VT[:2]))
>>> from scipy.spatial.distance import euclidean
>>> euclidean(C2.ravel(), C.ravel()) # Frobenius norm of C2 - C
1.6677932876555255
>>> C3 = np.dot(U[:, :3], np.dot(np.diag(s[:3]), VT[:3]))
>>> euclidean(C3.ravel(), C.ravel())
1.0747879905228703
Sanity check against scikit-learn's TruncatedSVD
(full disclosure: I wrote that):
>>> from sklearn.decomposition import TruncatedSVD
>>> TruncatedSVD(n_components=2).fit_transform(C.T)
array([[ 1.61889806, -0.45671719],
[ 0.60487661, -0.84256593],
[ 0.44034748, -0.29617436],
[ 0.96569316, 0.99731918],
[ 0.70302032, 0.35057241],
[ 0.26267284, 0.64674677]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With