I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next time gives out some results. I am doing it like this:
new_sentence = ['moscow', 'weather', 'cold'] model.train(new_sentence)
and its printing this as logs:
2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features 2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs 2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s
Now, when I query with similar new_sentence for most positives (as model.most_similar(positive=new_sentence)
) it gives out error:
Traceback (most recent call last): File "<pyshell#220>", line 1, in <module> model.most_similar(positive=['moscow', 'weather', 'cold']) File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar raise KeyError("word '%s' not in vocabulary" % word) KeyError: "word 'cold' not in vocabulary"
Which indicates that the word 'cold' is not part of the vocabulary over which i trained the thing (am I right)?
So the question is: How to update the model so that it gives out all the possible similarities for the given new sentence?
The full model can be stored/loaded via its save() and load() methods. The trained word vectors can also be stored/loaded from a format compatible with the original word2vec implementation via self.
Gensim is a topic modelling library for Python that provides access to Word2Vec and other word embedding algorithms for training, and it also allows pre-trained word embeddings that you can download from the internet to be loaded.
Gensim provides the Word2Vec class for working with a Word2Vec model. Learning a word embedding from text involves loading and organizing the text into sentences and providing them to the constructor of a new Word2Vec() instance. For example: sentences = ... model = Word2Vec(sentences)
train()
expects a sequence of sentences on input, not one sentence.
train()
only updates weights for existing feature vectors based on existing vocabulary. You cannot add new vocabulary (=new feature vectors) using train()
.
As of gensim 0.13.3 it's possible to do online training of Word2Vec with gensim.
model.build_vocab(new_sentences, update=True) model.train(new_sentences)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With