Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to infer new word vectors from a gensim word2vec model?

I want to add new words into a trained gensim word2vec model using a new text dataset. However, I want to preserve the old word embeddings and just add the new words from the dataset into the existing model. This means simple retraining of the old model with the new text dataset isn't an option as it will readjust the vectors of the previous word embeddings that are also in the new text dataset. Can you give any suggestions regarding this task? I would like something like Gensim's doc2vec infer feature where you feed the model some text input and it gives a vector as an output. Thanks.

like image 676
Wargream Avatar asked Nov 08 '22 09:11

Wargream


1 Answers

I would do the following (pseudoPython):

for word in new_words:
    # find words that should be nearby
    synonyms = thesaurus.lookup(word)

    # initialize an empty word vector
    new_word_embedding = np.zeros(number_of_dimensions_a_word_vector_is)

    # average the embeddings of synonyms
    for syn in synonyms:
        if w2v.get_embedding(syn):
            a = np.array(new_word_embedding, w2v.get_embedding(syn))
            new_word_embedding = np.mean(a, axis=0)
like image 149
Sam H. Avatar answered Nov 15 '22 11:11

Sam H.