Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get most similar words, given the vector of the word (not the word itself)

Tags:

Using the gensim.models.Word2Vec library, you have the possibility to provide a model and a "word" for which you want to find the list of most similar words:

model = gensim.models.Word2Vec.load_word2vec_format(model_file, binary=True) model.most_similar(positive=[WORD], topn=N) 

I wonder if there is a possibility to give the system as input the model and a "vector", and ask the system to return the top similar words (which their vectors is very close to the given vector). Something similar to:

model.most_similar(positive=[VECTOR], topn=N) 

I need this functionality for a bilingual setting, in which I have 2 models (English and German), as well as some English words for which I need to find their most similar German candidates. What I want to do is to get the vector of each English word from the English model:

model_EN = gensim.models.Word2Vec.load_word2vec_format(model_file_EN, binary=True) vector_w_en=model_EN[WORD_EN] 

and then query the German model with these vectors.

model_DE = gensim.models.Word2Vec.load_word2vec_format(model_file_DE, binary=True) model_DE.most_similar(positive=[vector_w_en], topn=N) 

I have implemented this in C using the original distance function in the word2vec package. But, now I need it to be in python, in order to be able to integrate it with my other scripts.

Do you know if there is already a method in gensim.models.Word2Vec library or other similar libraries which does this? Do I need to implement it by myself?

like image 295
amin Avatar asked Jun 14 '16 17:06

amin


People also ask

What is a vector of words?

WHAT IS A WORD VECTOR? A word vector is an attempt to mathematically represent the meaning of a word. In essence, a computer goes through some text (ideally a lot of text) and calculates how often words show up next to each other. These frequencies are represented with numbers.

How do you convert words into vectors?

Converting words to vectors, or word vectorization, is a natural language processing (NLP) process. The process uses language models to map words into vector space. A vector space represents each word by a vector of real numbers. It also allows words with similar meanings have similar representations.

What are word Embeddings why are they useful do you know word2vec?

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.


1 Answers

The method similar_by_vector returns the top-N most similar words by vector:

similar_by_vector(vector, topn=10, restrict_vocab=None) 
like image 158
user48135 Avatar answered Oct 01 '22 10:10

user48135