Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incremental Word2Vec Model Training in gensim

I have tried to train incrementally word2vec model produced by gensim. But I found that the vocabulary size doesn't increased , only the word2vec model weights are updated . But i need to update both vocabulary and model size .

#Load data 
sentences = []
....................

#Training 
model = Word2Vec(sentences, size=100)
model.save("modelbygensim.txt")
model.save_word2vec_format("modelbygensim_text.txt")



#Incremental Training 
model = Word2Vec.load('modelbygensim.txt')
model.train(sentences)
model.save("modelbygensim_incremental.txt")
model.save_word2vec_format("modelbygensim_text_incremental.txt")
like image 766
Rabindra Nath Nandi Avatar asked Mar 12 '17 09:03

Rabindra Nath Nandi


1 Answers

By default, gensim Word2Vec only does vocabulary-discovery once. It will happen when you supply a corpus like your sentences to the initial constructor (which does an automatic vocabulary-scan and train), or alternatively when you call build_vocab(). While you can continue to call train(), no new words will be recognized.

There is support (that I would consider experimental) for calling build_vocab() with new text examples, and an update=True parameter, to expand the vocabulary. While this would let further train() calls train both old-and-new words, there are many caveats:

  • such sequential training may not lead to models as good, or as self-consistent, as providing all examples interleaved. (For example, the continued training may drift words learned-from-later-batches arbitrarily far from words/word-senses in earlier batches that are not re-presented.)
  • such calls to train() should use one of the optional parameters to give an accurate estimate of the new batch size (in words or examples) so that learning-rate decay and progress-logging is done properly
  • the core algorithm and underlying theories aren't based on such batching, and multiple restarts of the learning-rate from high-to-low, so the interpretation of results – and relative strength/balance of resulting vectors - isn't as well-grounded

If at all possible, combine all your examples into one corpus, and do one large vocabulary-discovery then training.

like image 160
gojomo Avatar answered Oct 05 '22 04:10

gojomo