With Gensim, after I've trained my own model, I can use model.wv.most_similar('cat', topn=5)
and get a list of the 5 words that are closest to cat
in the vector space. For example:
from gensim.models import Word2Vec
model = Word2Vec.load('mymodel.model')
In: model.wv.most_similar('cat', topn=5)
Out: ('kitten', .99)
('dog', .98)
...
With spaCy, as per the documentation, I can do:
import spacy
nlp = spacy.load('en_core_web_md')
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
which gives similarity for tokens in a specified string. But combing through the docs and searching, I can't figure out if there is a gensim-type way of listing all similar words for a preloaded model with either nlp = spacy.load('en_core_web_lg')
or nlp = spacy.load('en_vectors_web_lg')
. Is there a way to do this?
It's not implemented out of the box. However, based on this issue (https://github.com/explosion/spaCy/issues/276) here is a code that makes it work as you want.
import spacy
import numpy as np
nlp = spacy.load('en_core_web_lg')
def most_similar(word, topn=5):
word = nlp.vocab[str(word)]
queries = [
w for w in word.vocab
if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
]
by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]
most_similar("dog", topn=3)
I used Andy's response and it worked correctly but slowly. To resolve that I took the approach below.
SpaCy uses the cosine similarity, in the backend, to compute .similarity
. Therefore, I decided to replace word.similarity(w)
with its optimized counterpart. The optimized method that I worked with was cosine_similarity_numba(w.vector, word.vector)
, shown below, that uses the Numba library to speed up computations. You should replace line 12 in the most_similar method with the line below.
by_similarity = sorted(queries, key=lambda w: cosine_similarity_numba(w.vector, word.vector), reverse=True)
The method became 2-3 times faster which was essential for me.
from numba import jit
@jit(nopython=True)
def cosine_similarity_numba(u:np.ndarray, v:np.ndarray):
assert(u.shape[0] == v.shape[0])
uv = 0
uu = 0
vv = 0
for i in range(u.shape[0]):
uv += u[i]*v[i]
uu += u[i]*u[i]
vv += v[i]*v[i]
cos_theta = 1
if uu != 0 and vv != 0:
cos_theta = uv/np.sqrt(uu*vv)
return cos_theta
I explained it in more details in this article: How to Build a Fast “Most-Similar Words” Method in SpaCy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With