I am using <code>Doc2vec</code> to get vectors from words. Please see my below code: <pre class="prettyprint"><code>from gensim.models.doc2vec import TaggedDocument f = open('test.txt','r') trainings = [TaggedDocument(words = data.strip().split(","),tags = [i]) for i,data in enumerate(f) model = Doc2Vec(vector_size=5, epochs=55, seed = 1, dm_concat=1) model.build_vocab(trainings) model.train(trainings, total_examples=model.corpus_count, epochs=model.epochs) model.save("doc2vec.model") model = Doc2Vec.load('doc2vec.model') for i in range(len(model.docvecs)): print(i,model.docvecs[i]) </code></pre> I have a <code>test.txt</code> file that its content has 2 lines and contents of these 2 lines is the same (they are "a") I trained with doc2vec and got the model, but the problem is although the contents of 2 lines is the same, doc2vec gave me 2 different vectors. <pre class="prettyprint"><code>0 [ 0.02730868 0.00393569 -0.08150548 -0.04009786 -0.01400406] 1 [ 0.03916578 -0.06423566 -0.05350181 -0.00726833 -0.08292392] </code></pre> I dont know why this happened. I thought that these vectors would be the same. Can you explain that? And if I want to make the same vectors for the sames words, what should I do in this case?

There is inherent randomness in Doc2Vec (and Word2Vec) algorithm, e.g. initial vectors are random already and are different even for identical sentences. You can comment the <code>model.train</code> call and see this for yourself. The details if you're interested: the vectors are initialized right after the vocab is built: in your case it's <code>model.build_vocab(...)</code> call, which in turn calls <code>model.reset_doc_weights()</code> method (see the source code at <code>gensim/models/doc2vec.py</code>), which initializes all vectors randomly, no matter what the sentences are. If at this point you reset the initialization and assign equal vectors, they shouldn't diverge anymore. In theory, if you train the identical sentences really long enough, the algorithm should converge to the same vector even with different initialization. But practically, it's not going to happen, and I don't think you should be worried about that.

@Maxim's answer is correct about the inherent randomness used by the algorithm, but you have additional problems with this example: <ul> <li><code>Doc2Vec</code> doesn't give meaningful results on tiny, toy-sized examples. The vectors only acquire good relative meanings when they're the result of a large, diverse set of contrasting training-contexts. Your 2-line dataset, run through 55 training cycles, is really just providing the model with 1 unique text, 110 times. </li> <li>Even though you've wisely reduced the vector-size to a tiny number (5) to reflect the tiny data, it's still a too-large model for just 2 examples, prone to complete 'overfitting'. The model could randomly assign line #1 the vector [1.0, 0.0, 0.0, 0.0, 0.0], and line #2 [0.0, 1.0, 0.0, 0.0, 0.0]. And then, through all its training, only update its internal weights (never the doc-vectors themselves), but still achieve internal word-predictions just as good or better than in the real scenario, where everything is incrementally updated, because there's enough free state in the model that there's never any essential competition/compression/tradeoffs forcing the two sentences to converge where similar. (There's many solutions, and most don't involve any useful generalized 'learning'. Only large datasets, forcing the model into a tug-of-war between modeling multiple examples as well as possible with tradeoffs, creates the learning.)</li> <li><code>dm_concat=1</code> is a non-default experimental mode that requires even more data to train, and results in larger/slower models. Avoid using it unless you're sure – and can prove with results – that it helps for your use.</li> </ul> Even when these are fixed, complete determinism isn't automatic in <code>Doc2Vec</code> – and you shouldn't really try to eliminate that. (The small jitter between runs is a useful signal/reminder of the essential variances in this algorithm – and if your training/evaluation remains stable across such small variances, that's an extra indicator that it's functional.)

Why Doc2vec gives 2 different vectors for the same texts

Tags:

python

nlp

gensim

word2vec

doc2vec

I am using Doc2vec to get vectors from words. Please see my below code:

from gensim.models.doc2vec import TaggedDocument
f = open('test.txt','r')

trainings = [TaggedDocument(words = data.strip().split(","),tags = [i]) for i,data in enumerate(f)


model = Doc2Vec(vector_size=5,  epochs=55, seed = 1, dm_concat=1)

model.build_vocab(trainings)
model.train(trainings, total_examples=model.corpus_count, epochs=model.epochs)

model.save("doc2vec.model")

model = Doc2Vec.load('doc2vec.model')
for i in range(len(model.docvecs)):
    print(i,model.docvecs[i])

I have a test.txt file that its content has 2 lines and contents of these 2 lines is the same (they are "a") I trained with doc2vec and got the model, but the problem is although the contents of 2 lines is the same, doc2vec gave me 2 different vectors.

0 [ 0.02730868  0.00393569 -0.08150548 -0.04009786 -0.01400406]
1 [ 0.03916578 -0.06423566 -0.05350181 -0.00726833 -0.08292392]

I dont know why this happened. I thought that these vectors would be the same. Can you explain that? And if I want to make the same vectors for the sames words, what should I do in this case?

976

asked May 16 '18 04:05

Thanh Bui

2 Answers

There is inherent randomness in Doc2Vec (and Word2Vec) algorithm, e.g. initial vectors are random already and are different even for identical sentences. You can comment the model.train call and see this for yourself.

The details if you're interested: the vectors are initialized right after the vocab is built: in your case it's model.build_vocab(...) call, which in turn calls model.reset_doc_weights() method (see the source code at gensim/models/doc2vec.py), which initializes all vectors randomly, no matter what the sentences are. If at this point you reset the initialization and assign equal vectors, they shouldn't diverge anymore.

In theory, if you train the identical sentences really long enough, the algorithm should converge to the same vector even with different initialization. But practically, it's not going to happen, and I don't think you should be worried about that.

answered Oct 31 '22 02:10

Maxim

@Maxim's answer is correct about the inherent randomness used by the algorithm, but you have additional problems with this example:

Doc2Vec doesn't give meaningful results on tiny, toy-sized examples. The vectors only acquire good relative meanings when they're the result of a large, diverse set of contrasting training-contexts. Your 2-line dataset, run through 55 training cycles, is really just providing the model with 1 unique text, 110 times.
Even though you've wisely reduced the vector-size to a tiny number (5) to reflect the tiny data, it's still a too-large model for just 2 examples, prone to complete 'overfitting'. The model could randomly assign line #1 the vector [1.0, 0.0, 0.0, 0.0, 0.0], and line #2 [0.0, 1.0, 0.0, 0.0, 0.0]. And then, through all its training, only update its internal weights (never the doc-vectors themselves), but still achieve internal word-predictions just as good or better than in the real scenario, where everything is incrementally updated, because there's enough free state in the model that there's never any essential competition/compression/tradeoffs forcing the two sentences to converge where similar. (There's many solutions, and most don't involve any useful generalized 'learning'. Only large datasets, forcing the model into a tug-of-war between modeling multiple examples as well as possible with tradeoffs, creates the learning.)
dm_concat=1 is a non-default experimental mode that requires even more data to train, and results in larger/slower models. Avoid using it unless you're sure – and can prove with results – that it helps for your use.

Even when these are fixed, complete determinism isn't automatic in Doc2Vec – and you shouldn't really try to eliminate that. (The small jitter between runs is a useful signal/reminder of the essential variances in this algorithm – and if your training/evaluation remains stable across such small variances, that's an extra indicator that it's functional.)

138

answered Oct 31 '22 02:10

gojomo

Related questions
                            
                                PYQT - nesting widgets and layouts in multiple levels
                            
                                How to remove the multiindex from GroupBy.apply()?
                            
                                How can I parse a host:port pair in Python
                            
                                Suptitle alignment issues in Matplotlib
                            
                                gsutil no longer works?
                            
                                What's the inferred name of variables in argparse in conflicting cases
                            
                                How to set the timeout of 'driver.get' for python selenium 3.8.0?
                            
                                Seaborn heatmap, custom tick values
                            
                                Round to nearest 1000 in pandas
                            
                                Pandas, how to combine multiple columns into an array column
                            
                                Django '/' only homepage url error
                            
                                Making numpy arrays JSON serializable
                            
                                opposite of df.diff() in pandas
                            
                                What does x in range(...) == y mean in Python 3? [duplicate]
                            
                                Django's template tag inside javascript
                            
                                Unit test pyspark code using python
                            
                                Python: Normalize image exposure
                            
                                Keep a column with a categorical variable in Pandas with groupby and mean()
                            
                                Flask ImportError: cannot import name app
                            
                                How to apply a condition to pandas iloc

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With