Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mapping word vector to the most similar/closest word using spaCy

I am using spaCy as part of a topic modelling solution and I have a situation where I need to map a derived word vector to the "closest" or "most similar" word in a vocabulary of word vectors.

I see gensim has a function (WordEmbeddingsKeyedVectors.similar_by_vector) to calculate this, but I was wondering if spaCy has something like this to map a vector to a word within its vocabulary (nlp.vocab)?

like image 805
Eric Broda Avatar asked Feb 15 '19 21:02

Eric Broda


2 Answers

After a bit of experimentation, I found a scikit function (cdist in scikit.spatial.distance) that finds a "close" vector in a vector space to the input vector.

# Imports
from scipy.spatial import distance
import spaCy

# Load the spacy vocabulary
nlp = spacy.load("en_core_web_lg")

# Format the input vector for use in the distance function
# In this case we will artificially create a word vector from a real word ("frog")
# but any derived word vector could be used
input_word = "frog"
p = np.array([nlp.vocab[input_word].vector])

# Format the vocabulary for use in the distance function
ids = [x for x in nlp.vocab.vectors.keys()]
vectors = [nlp.vocab.vectors[x] for x in ids]
vectors = np.array(vectors)

# *** Find the closest word below ***
closest_index = distance.cdist(p, vectors).argmin()
word_id = ids[closest_index]
output_word = nlp.vocab[word_id].text
# output_word is identical, or very close, to the input word
like image 161
Eric Broda Avatar answered Oct 24 '22 13:10

Eric Broda


Yes, spacy has an API method to do that, just like KeyedVectors.similar_by_vector:

import numpy as np
import spacy

nlp = spacy.load("en_core_web_lg")

your_word = "king"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)
['King', 'KIng', 'king', 'KING', 'kings', 'KINGS', 'Kings', 'PRINCE', 'Prince', 'prince']

(the words are not properly normalized in sm_core_web_lg, but you could play with other models and observe a more representative output).

like image 28
Amir Avatar answered Oct 24 '22 14:10

Amir