In the tutorial example of spaCy in Python the results of apples.similarity(oranges) is
0.39289959293092641
instead of 0.7857989796519943
Any reasons for that? Original docs of the tutorial https://spacy.io/docs/ A tutorial with a different answer to the one I get: http://textminingonline.com/getting-started-with-spacy
Thanks
That appears to be a bug in spacy.
Somehow vector_norm is incorrectly calculated.
import spacy
import numpy as np
nlp = spacy.load("en")
# using u"apples" just as an example
apples = nlp.vocab[u"apples"]
print apples.vector_norm
# prints 1.4142135381698608, or sqrt(2)
print np.sqrt(np.dot(apples.vector, apples.vector))
# prints 1.0
Then vector_norm is used in similarity, which always returns a value that is always half of the correct value.
def similarity(self, other):
if self.vector_norm == 0 or other.vector_norm == 0:
return 0.0
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
If you are ranking similarity scores for synonyms, this might be OK. But if you need the correct cosine similarity score, then the result is incorrect.
I submitted the issue here. Hopefully it will get fixed soon.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With