Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Doc2vec gives 2 different vectors for the same texts

I am using Doc2vec to get vectors from words. Please see my below code:

from gensim.models.doc2vec import TaggedDocument
f = open('test.txt','r')

trainings = [TaggedDocument(words = data.strip().split(","),tags = [i]) for i,data in enumerate(f)


model = Doc2Vec(vector_size=5,  epochs=55, seed = 1, dm_concat=1)

model.build_vocab(trainings)
model.train(trainings, total_examples=model.corpus_count, epochs=model.epochs)

model.save("doc2vec.model")

model = Doc2Vec.load('doc2vec.model')
for i in range(len(model.docvecs)):
    print(i,model.docvecs[i])

I have a test.txt file that its content has 2 lines and contents of these 2 lines is the same (they are "a") I trained with doc2vec and got the model, but the problem is although the contents of 2 lines is the same, doc2vec gave me 2 different vectors.

0 [ 0.02730868  0.00393569 -0.08150548 -0.04009786 -0.01400406]
1 [ 0.03916578 -0.06423566 -0.05350181 -0.00726833 -0.08292392]

I dont know why this happened. I thought that these vectors would be the same. Can you explain that? And if I want to make the same vectors for the sames words, what should I do in this case?

like image 976
Thanh Bui Avatar asked May 16 '18 04:05

Thanh Bui


People also ask

How is Doc2Vec different from Word2Vec?

Doc2Vec is another widely used technique that creates an embedding of a document irrespective to its length. While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus.

Which is better Word2Vec or Doc2Vec?

To sum up, Doc2Vec works much better then Word2Vec model. But is is worth saying that for documents classification we need to somehow transform vectors of words made by Word2Vec to vectors of documents. The way I did it in this notebook is not the best. Likely that is why we got such bad result for Word2Vec model.

What is vector size in Doc2Vec?

The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents.

Why do we use Doc2Vec?

The goal is to classify consumer finance complaints into 12 pre-defined classes using Doc2Vec and Logistic Regression. Doc2vec is an NLP tool for representing documents as a vector and is a generalizing of the word2vec method. In order to understand doc2vec, it is advisable to understand word2vec approach.


2 Answers

There is inherent randomness in Doc2Vec (and Word2Vec) algorithm, e.g. initial vectors are random already and are different even for identical sentences. You can comment the model.train call and see this for yourself.

The details if you're interested: the vectors are initialized right after the vocab is built: in your case it's model.build_vocab(...) call, which in turn calls model.reset_doc_weights() method (see the source code at gensim/models/doc2vec.py), which initializes all vectors randomly, no matter what the sentences are. If at this point you reset the initialization and assign equal vectors, they shouldn't diverge anymore.

In theory, if you train the identical sentences really long enough, the algorithm should converge to the same vector even with different initialization. But practically, it's not going to happen, and I don't think you should be worried about that.

like image 21
Maxim Avatar answered Oct 31 '22 02:10

Maxim


@Maxim's answer is correct about the inherent randomness used by the algorithm, but you have additional problems with this example:

  • Doc2Vec doesn't give meaningful results on tiny, toy-sized examples. The vectors only acquire good relative meanings when they're the result of a large, diverse set of contrasting training-contexts. Your 2-line dataset, run through 55 training cycles, is really just providing the model with 1 unique text, 110 times.

  • Even though you've wisely reduced the vector-size to a tiny number (5) to reflect the tiny data, it's still a too-large model for just 2 examples, prone to complete 'overfitting'. The model could randomly assign line #1 the vector [1.0, 0.0, 0.0, 0.0, 0.0], and line #2 [0.0, 1.0, 0.0, 0.0, 0.0]. And then, through all its training, only update its internal weights (never the doc-vectors themselves), but still achieve internal word-predictions just as good or better than in the real scenario, where everything is incrementally updated, because there's enough free state in the model that there's never any essential competition/compression/tradeoffs forcing the two sentences to converge where similar. (There's many solutions, and most don't involve any useful generalized 'learning'. Only large datasets, forcing the model into a tug-of-war between modeling multiple examples as well as possible with tradeoffs, creates the learning.)

  • dm_concat=1 is a non-default experimental mode that requires even more data to train, and results in larger/slower models. Avoid using it unless you're sure – and can prove with results – that it helps for your use.

Even when these are fixed, complete determinism isn't automatic in Doc2Vec – and you shouldn't really try to eliminate that. (The small jitter between runs is a useful signal/reminder of the essential variances in this algorithm – and if your training/evaluation remains stable across such small variances, that's an extra indicator that it's functional.)

like image 138
gojomo Avatar answered Oct 31 '22 02:10

gojomo