I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next time gives out some results. I am doing it like this: <pre class="prettyprint"><code>new_sentence = ['moscow', 'weather', 'cold'] model.train(new_sentence) </code></pre> and its printing this as logs: <pre class="prettyprint"><code>2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features 2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs 2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s </code></pre> Now, when I query with similar new_sentence for most positives (as <code>model.most_similar(positive=new_sentence)</code>) it gives out error: <pre class="prettyprint"><code>Traceback (most recent call last): File "<pyshell#220>", line 1, in <module> model.most_similar(positive=['moscow', 'weather', 'cold']) File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar raise KeyError("word '%s' not in vocabulary" % word) KeyError: "word 'cold' not in vocabulary" </code></pre> Which indicates that the word 'cold' is not part of the vocabulary over which i trained the thing (am I right)? So the question is: How to update the model so that it gives out all the possible similarities for the given new sentence?

<ol> <li><code>train()</code> expects a sequence of sentences on input, not one sentence.</li> <li><code>train()</code> only updates weights for existing feature vectors based on existing vocabulary. You cannot add new vocabulary (=new feature vectors) using <code>train()</code>.</li> </ol>

As of gensim 0.13.3 it's possible to do online training of Word2Vec with gensim. <pre class="prettyprint"><code>model.build_vocab(new_sentences, update=True) model.train(new_sentences) </code></pre>

Update gensim word2vec model

Tags:

gensim

word2vec

I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next time gives out some results. I am doing it like this:

new_sentence = ['moscow', 'weather', 'cold'] model.train(new_sentence)

and its printing this as logs:

2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features 2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs 2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s

Now, when I query with similar new_sentence for most positives (as model.most_similar(positive=new_sentence)) it gives out error:

Traceback (most recent call last):  File "<pyshell#220>", line 1, in <module>  model.most_similar(positive=['moscow', 'weather', 'cold'])  File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar  raise KeyError("word '%s' not in vocabulary" % word)   KeyError: "word 'cold' not in vocabulary"

Which indicates that the word 'cold' is not part of the vocabulary over which i trained the thing (am I right)?

So the question is: How to update the model so that it gives out all the possible similarities for the given new sentence?

242

asked Mar 01 '14 22:03

user2480542

2 Answers

train() expects a sequence of sentences on input, not one sentence.
train() only updates weights for existing feature vectors based on existing vocabulary. You cannot add new vocabulary (=new feature vectors) using train().

answered Sep 20 '22 06:09

Radim

As of gensim 0.13.3 it's possible to do online training of Word2Vec with gensim.

model.build_vocab(new_sentences, update=True) model.train(new_sentences)

answered Sep 21 '22 06:09

Kamil Sindi

Related questions
                            
                                Chunkize warning while installing gensim
                            
                                How to use TaggedDocument in gensim?
                            
                                How to load sentences into Python gensim?
                            
                                Using a Word2Vec model pre-trained on wikipedia
                            
                                In spacy, how to use your own word2vec model created in gensim?
                            
                                gensim word2vec accessing in/out vectors
                            
                                Gensim train word2vec on wikipedia - preprocessing and parameters
                            
                                word2vec - what is best? add, concatenate or average word vectors?
                            
                                Document topical distribution in Gensim LDA
                            
                                How to remove a word completely from a Word2Vec model in gensim?
                            
                                LDA model generates different topics everytime i train on the same corpus
                            
                                Interpreting negative Word2Vec similarity from gensim
                            
                                Gensim: What is difference between word2vec and doc2vec?
                            
                                Python: gensim: RuntimeError: you must first build vocabulary before training the model
                            
                                How to print the LDA topics models from gensim? Python
                            
                                Understanding LDA implementation using gensim
                            
                                word2vec lemmatization of corpus before training
                            
                                Topic distribution: How do we see which document belong to which topic after doing LDA in python
                            
                                Python Gensim: how to calculate document similarity using the LDA model?
                            
                                How to get tfidf with pandas dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With