I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.
The example of training data (length>10)
docs = ['This is a sentence', 'This is another sentence', ....]
with some pre-treatment
doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
tagd = TaggedDocument(doc_[i],str(i))
doc_tagged.append(tagd)
tagged docs
TaggedDocument(words=array(['a', 'b', 'c', ..., ],
dtype='<U32'), tags='117')
fit a doc2vec model
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)
then i get the final model
len(model.docvecs)
the result is 10...
I tried other datasets (length>100, 1000) and got same result of len(model.docvecs)
.
So, my question is:
How to use model.docvecs to get full vectors? (without using model.infer_vector
)
Is model.docvecs
designed to provide all training docvecs?
The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents.
To sum up, Doc2Vec works much better then Word2Vec model. But is is worth saying that for documents classification we need to somehow transform vectors of words made by Word2Vec to vectors of documents. The way I did it in this notebook is not the best. Likely that is why we got such bad result for Word2Vec model.
First, we transform each word in the corpus to a vector using the traditional Word2Vec algorithm. Softmax layer outputs the vector representation of the Document. The model trains until all weights are setup in a way to achieves the highest prediction probabilities (or as close it can get).
The bug is in this line:
tagd = TaggedDocument(doc[i],str(i))
Gensim's TaggedDocument
accepts a sequence of tags as a second argument. When you pass a string '123'
, it's turned into ['1', '2', '3']
, because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags ['0', ..., '9']
, in various combinations.
Another issue: you're defining doc_
and never actually using it, so your documents will be split incorrectly as well.
Here's the proper solution:
docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With