Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Similarity in Spacy

I am trying to understand how similarity in Spacy works. I tried using Melania Trump's speech and Michelle Obama's speech to see how similar they were.

This is my code.

import spacy
nlp = spacy.load('en_core_web_lg')

file1 = open("melania.txt").read().decode('ascii', 'ignore')
file2 = open("michelle.txt").read().decode('ascii', 'ignore')

doc1 = nlp(unicode(file1))
doc2 = nlp(unicode(file2))
print doc1.similarity(doc2)

I get the similarity score as 0.9951584208511974. This similarity score looks very high to me. Is this correct? Am I doing something wrong?

like image 683
thehydrogen Avatar asked Nov 23 '18 22:11

thehydrogen


People also ask

What is similarity in NLP?

Text Similarity In Natural Language Processing (NLP), the answer to “how two words/phrases/documents are similar to each other?” is a crucial topic for research and applications. Text similarity is to calculate how two words/phrases/documents are close to each other. That closeness may be lexical or in meaning.

How do you find the similarity between two words in Python?

There are two good ways to calculate the similarity between two words. You can simply use embedding models like word2vec, glove, or fasttext (my recommendation), which all are famous and useful. The main objective of embedding models is to map a word to a vector.

Is spaCy better than NLTK?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

What is sentence similarity?

Sentence similarity or semantic textual similarity is a measure of how similar two pieces of text are, or to what degree they express the same meaning. Related tasks include paraphrase or duplicate identification, search, and matching applications.


1 Answers

By default spaCy calculates cosine similarity. Similarity is determined by comparing word vectors or word embeddings, multi-dimensional meaning representations of a word.

It returns return (numpy.dot(self.vector, other.vector) / (self_norm * other_norm))

text1 = 'How can I end violence?'
text2 = 'What should I do to be a peaceful?'
doc1 = nlp(text1)
doc2 = nlp(text2)
print("spaCy :", doc1.similarity(doc2))

print(np.dot(doc1.vector, doc2.vector) / (np.linalg.norm(doc1.vector) * np.linalg.norm(doc2.vector)))

Output:

spaCy : 0.916553147896471
0.9165532

It seems that spaCy's .vector method created the vectors. Documentation says that spaCy's models are trained from GloVe's vectors.

like image 57
Srce Cde Avatar answered Sep 23 '22 17:09

Srce Cde