Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List most similar words in spaCy in pretrained model

Tags:

python

spacy

With Gensim, after I've trained my own model, I can use model.wv.most_similar('cat', topn=5) and get a list of the 5 words that are closest to cat in the vector space. For example:

from gensim.models import Word2Vec
model = Word2Vec.load('mymodel.model')

In: model.wv.most_similar('cat', topn=5)
Out: ('kitten', .99)
     ('dog', .98)
     ...

With spaCy, as per the documentation, I can do:

import spacy

nlp = spacy.load('en_core_web_md')
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

which gives similarity for tokens in a specified string. But combing through the docs and searching, I can't figure out if there is a gensim-type way of listing all similar words for a preloaded model with either nlp = spacy.load('en_core_web_lg') or nlp = spacy.load('en_vectors_web_lg'). Is there a way to do this?

like image 686
snapcrack Avatar asked Aug 28 '19 17:08

snapcrack


2 Answers

It's not implemented out of the box. However, based on this issue (https://github.com/explosion/spaCy/issues/276) here is a code that makes it work as you want.

import spacy
import numpy as np
nlp = spacy.load('en_core_web_lg')

def most_similar(word, topn=5):
  word = nlp.vocab[str(word)]
  queries = [
      w for w in word.vocab 
      if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
  ]

  by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
  return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

most_similar("dog", topn=3)
like image 78
Romain Avatar answered Sep 18 '22 21:09

Romain


I used Andy's response and it worked correctly but slowly. To resolve that I took the approach below.

SpaCy uses the cosine similarity, in the backend, to compute .similarity. Therefore, I decided to replace word.similarity(w) with its optimized counterpart. The optimized method that I worked with was cosine_similarity_numba(w.vector, word.vector), shown below, that uses the Numba library to speed up computations. You should replace line 12 in the most_similar method with the line below.

by_similarity = sorted(queries, key=lambda w: cosine_similarity_numba(w.vector, word.vector), reverse=True)

The method became 2-3 times faster which was essential for me.

from numba import jit

@jit(nopython=True)
def cosine_similarity_numba(u:np.ndarray, v:np.ndarray):
    assert(u.shape[0] == v.shape[0])
    uv = 0
    uu = 0
    vv = 0
    for i in range(u.shape[0]):
        uv += u[i]*v[i]
        uu += u[i]*u[i]
        vv += v[i]*v[i]
    cos_theta = 1
    if uu != 0 and vv != 0:
        cos_theta = uv/np.sqrt(uu*vv)
    return cos_theta

I explained it in more details in this article: How to Build a Fast “Most-Similar Words” Method in SpaCy

like image 22
Pedram Avatar answered Sep 20 '22 21:09

Pedram