Using the gensim.models.Word2Vec
library, you have the possibility to provide a model and a "word" for which you want to find the list of most similar words:
model = gensim.models.Word2Vec.load_word2vec_format(model_file, binary=True) model.most_similar(positive=[WORD], topn=N)
I wonder if there is a possibility to give the system as input the model and a "vector", and ask the system to return the top similar words (which their vectors is very close to the given vector). Something similar to:
model.most_similar(positive=[VECTOR], topn=N)
I need this functionality for a bilingual setting, in which I have 2 models (English and German), as well as some English words for which I need to find their most similar German candidates. What I want to do is to get the vector of each English word from the English model:
model_EN = gensim.models.Word2Vec.load_word2vec_format(model_file_EN, binary=True) vector_w_en=model_EN[WORD_EN]
and then query the German model with these vectors.
model_DE = gensim.models.Word2Vec.load_word2vec_format(model_file_DE, binary=True) model_DE.most_similar(positive=[vector_w_en], topn=N)
I have implemented this in C using the original distance function in the word2vec package. But, now I need it to be in python, in order to be able to integrate it with my other scripts.
Do you know if there is already a method in gensim.models.Word2Vec
library or other similar libraries which does this? Do I need to implement it by myself?
WHAT IS A WORD VECTOR? A word vector is an attempt to mathematically represent the meaning of a word. In essence, a computer goes through some text (ideally a lot of text) and calculates how often words show up next to each other. These frequencies are represented with numbers.
Converting words to vectors, or word vectorization, is a natural language processing (NLP) process. The process uses language models to map words into vector space. A vector space represents each word by a vector of real numbers. It also allows words with similar meanings have similar representations.
Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
The method similar_by_vector
returns the top-N most similar words by vector:
similar_by_vector(vector, topn=10, restrict_vocab=None)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With