Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Update gensim word2vec model

I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next time gives out some results. I am doing it like this:

new_sentence = ['moscow', 'weather', 'cold'] model.train(new_sentence) 

and its printing this as logs:

2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features 2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs 2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s 

Now, when I query with similar new_sentence for most positives (as model.most_similar(positive=new_sentence)) it gives out error:

Traceback (most recent call last):  File "<pyshell#220>", line 1, in <module>  model.most_similar(positive=['moscow', 'weather', 'cold'])  File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar  raise KeyError("word '%s' not in vocabulary" % word)   KeyError: "word 'cold' not in vocabulary" 

Which indicates that the word 'cold' is not part of the vocabulary over which i trained the thing (am I right)?

So the question is: How to update the model so that it gives out all the possible similarities for the given new sentence?

like image 242
user2480542 Avatar asked Mar 01 '14 22:03

user2480542


People also ask

How do I install Gensim Word2Vec model?

The full model can be stored/loaded via its save() and load() methods. The trained word vectors can also be stored/loaded from a format compatible with the original word2vec implementation via self.

Is Gensim Word2Vec pre trained?

Gensim is a topic modelling library for Python that provides access to Word2Vec and other word embedding algorithms for training, and it also allows pre-trained word embeddings that you can download from the internet to be loaded.

What is Word2Vec in Gensim?

Gensim provides the Word2Vec class for working with a Word2Vec model. Learning a word embedding from text involves loading and organizing the text into sentences and providing them to the constructor of a new Word2Vec() instance. For example: sentences = ... model = Word2Vec(sentences)


2 Answers

  1. train() expects a sequence of sentences on input, not one sentence.

  2. train() only updates weights for existing feature vectors based on existing vocabulary. You cannot add new vocabulary (=new feature vectors) using train().

like image 66
Radim Avatar answered Sep 20 '22 06:09

Radim


As of gensim 0.13.3 it's possible to do online training of Word2Vec with gensim.

model.build_vocab(new_sentences, update=True) model.train(new_sentences) 
like image 22
Kamil Sindi Avatar answered Sep 21 '22 06:09

Kamil Sindi