I want to add new words into a trained gensim word2vec model using a new text dataset. However, I want to preserve the old word embeddings and just add the new words from the dataset into the existing model. This means simple retraining of the old model with the new text dataset isn't an option as it will readjust the vectors of the previous word embeddings that are also in the new text dataset. Can you give any suggestions regarding this task? I would like something like Gensim's doc2vec infer feature where you feed the model some text input and it gives a vector as an output. Thanks.
I would do the following (pseudoPython):
for word in new_words:
# find words that should be nearby
synonyms = thesaurus.lookup(word)
# initialize an empty word vector
new_word_embedding = np.zeros(number_of_dimensions_a_word_vector_is)
# average the embeddings of synonyms
for syn in synonyms:
if w2v.get_embedding(syn):
a = np.array(new_word_embedding, w2v.get_embedding(syn))
new_word_embedding = np.mean(a, axis=0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With