Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using SVD to plot word vector to measure similarity

This is the code I am using to calculate a word co-occurrence matrix for immediate neighbor counts. I found the following code on the net, which uses SVD.

 import numpy as np
 la = np.linalg
 words = ['I','like','enjoying','deep','learning','NLP','flying','.']
 ### A Co-occurence matrix which counts how many times the word before and after a particular word appears ( ie, like appears after I 2 times)
 arr = np.array([[0,2,1,0,0,0,0,0],[2,0,0,1,0,1,0,0],[1,0,0,0,0,0,1,0],[0,0,0,1,0,0,0,1],[0,1,0,0,0,0,0,1],[0,0,1,0,0,0,0,8],[0,2,1,0,0,0,0,0],[0,0,1,1,1,0,0,0]])
 u, s, v = la.svd(arr, full_matrices=False)
 import matplotlib.pyplot as plt
 for i in xrange(len(words)):
     plt.text(u[i,2], u[i,3], words[i])

In the last line of code, the first element of U is used as an x-coordinate and the second element of U is used as a y-coordinate to project the words, to see the similarity. What is the intuition behind this approach? Why they are taking the 1st and 2nd elements in each row (each row represents each word) as x and y to represent a word? Please help.

like image 762
Seja Nair Avatar asked Jul 17 '15 18:07

Seja Nair


People also ask

Can SVD on clusters be used to characterize document similarity?

Thus, SVD on clusters, which constructs latent subspaces on document clusters, can characterize document similarity more accurately and appropriately than other SVD based methods. Here, we regard that the variances of the mentioned methods are comparable to each other because they have similar values.

Do all document vectors have the same cosine similarity?

Theoretically, according to the explanation, document vectors which are not in the same cluster submatrix will have zero cosine similarity. However, in fact, all document vectors have the same terms in representation and dimension expansion of document vectors is derived by merely copying the original pace of.

How do I calculate the similarity between two documents and spans?

To calculate the similarity of spans and documents, which don’t have their own word vectors, spaCy averages the word vectors of the tokens they contain. You can calculate the semantic similarity of two container objects even if the two objects are different.

What is the best way to measure similarity between two features?

You can use any similarity measure that best fits your data. The ideia is always the same: two samples which have very similar feature vectors (in my case, embeddings), will have a similarity score close to 1. The more different these vectors are, the closer the similarity score will be to zero.


1 Answers

import numpy as np
import matplotlib.pyplot as plt
la = np.linalg
words = ["I", "like", "enjoy", "deep", "learning", "NLP", "flying", "."]
X = np.array([[0,2,1,0,0,0,0,0], [2,0,0,1,0,1,0,0], [1,0,0,0,0,0,1,0], [0,1,0,0,1,0,0,0], [0,0,0,1,0,0,0,1], [0,1,0,0,0,0,0,1], [0,0,1,0,0,0,0,1], [0,0,0,0,1,1,1,0]])
U, s, Vh = la.svd(X, full_matrices = False)

#plot
for i in range(len(words)):
    plt.text(U[i,0], U[i,1], words[i])
plt.show()

In the plot, pan axes towards left and you will see all the words.

like image 198
Aakash Saxena Avatar answered Sep 25 '22 15:09

Aakash Saxena