I have tried to train incrementally word2vec model produced by gensim. But I found that the vocabulary size doesn't increased , only the word2vec model weights are updated . But i need to update both vocabulary and model size .
#Load data
sentences = []
....................
#Training
model = Word2Vec(sentences, size=100)
model.save("modelbygensim.txt")
model.save_word2vec_format("modelbygensim_text.txt")
#Incremental Training
model = Word2Vec.load('modelbygensim.txt')
model.train(sentences)
model.save("modelbygensim_incremental.txt")
model.save_word2vec_format("modelbygensim_text_incremental.txt")
By default, gensim Word2Vec only does vocabulary-discovery once. It will happen when you supply a corpus like your sentences
to the initial constructor (which does an automatic vocabulary-scan and train), or alternatively when you call build_vocab()
. While you can continue to call train()
, no new words will be recognized.
There is support (that I would consider experimental) for calling build_vocab()
with new text examples, and an update=True
parameter, to expand the vocabulary. While this would let further train()
calls train both old-and-new words, there are many caveats:
train()
should use one of the optional parameters to give an accurate estimate of the new batch size (in words or examples) so that learning-rate decay and progress-logging is done properlyIf at all possible, combine all your examples into one corpus, and do one large vocabulary-discovery then training.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With