I am trying to understand how similarity in Spacy works. I tried using Melania Trump's speech and Michelle Obama's speech to see how similar they were.
This is my code.
import spacy
nlp = spacy.load('en_core_web_lg')
file1 = open("melania.txt").read().decode('ascii', 'ignore')
file2 = open("michelle.txt").read().decode('ascii', 'ignore')
doc1 = nlp(unicode(file1))
doc2 = nlp(unicode(file2))
print doc1.similarity(doc2)
I get the similarity score as 0.9951584208511974. This similarity score looks very high to me. Is this correct? Am I doing something wrong?
Text Similarity In Natural Language Processing (NLP), the answer to “how two words/phrases/documents are similar to each other?” is a crucial topic for research and applications. Text similarity is to calculate how two words/phrases/documents are close to each other. That closeness may be lexical or in meaning.
There are two good ways to calculate the similarity between two words. You can simply use embedding models like word2vec, glove, or fasttext (my recommendation), which all are famous and useful. The main objective of embedding models is to map a word to a vector.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
Sentence similarity or semantic textual similarity is a measure of how similar two pieces of text are, or to what degree they express the same meaning. Related tasks include paraphrase or duplicate identification, search, and matching applications.
By default spaCy calculates cosine similarity. Similarity is determined by comparing word vectors or word embeddings, multi-dimensional meaning representations of a word.
It returns return (numpy.dot(self.vector, other.vector) / (self_norm * other_norm))
text1 = 'How can I end violence?'
text2 = 'What should I do to be a peaceful?'
doc1 = nlp(text1)
doc2 = nlp(text2)
print("spaCy :", doc1.similarity(doc2))
print(np.dot(doc1.vector, doc2.vector) / (np.linalg.norm(doc1.vector) * np.linalg.norm(doc2.vector)))
Output:
spaCy : 0.916553147896471
0.9165532
It seems that spaCy's .vector
method created the vectors. Documentation says that spaCy's models are trained from GloVe's vectors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With