Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Doc2vec: Only 10 docvecs in gensim doc2vec model?

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.

The example of training data (length>10)

docs = ['This is a sentence', 'This is another sentence', ....]

with some pre-treatment

doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
  tagd = TaggedDocument(doc_[i],str(i))
  doc_tagged.append(tagd)

tagged docs

TaggedDocument(words=array(['a', 'b', 'c', ..., ],
  dtype='<U32'), tags='117')

fit a doc2vec model

model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)

then i get the final model

len(model.docvecs)

the result is 10...

I tried other datasets (length>100, 1000) and got same result of len(model.docvecs). So, my question is: How to use model.docvecs to get full vectors? (without using model.infer_vector) Is model.docvecs designed to provide all training docvecs?

like image 546
GemOfRoe Avatar asked Feb 28 '18 03:02

GemOfRoe


People also ask

What is vector size in Doc2Vec?

The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents.

Is Doc2Vec better than Word2Vec?

To sum up, Doc2Vec works much better then Word2Vec model. But is is worth saying that for documents classification we need to somehow transform vectors of words made by Word2Vec to vectors of documents. The way I did it in this notebook is not the best. Likely that is why we got such bad result for Word2Vec model.

How does Doc2Vec inference work?

First, we transform each word in the corpus to a vector using the traditional Word2Vec algorithm. Softmax layer outputs the vector representation of the Document. The model trains until all weights are setup in a way to achieves the highest prediction probabilities (or as close it can get).


1 Answers

The bug is in this line:

tagd = TaggedDocument(doc[i],str(i))

Gensim's TaggedDocument accepts a sequence of tags as a second argument. When you pass a string '123', it's turned into ['1', '2', '3'], because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags ['0', ..., '9'], in various combinations.

Another issue: you're defining doc_ and never actually using it, so your documents will be split incorrectly as well.

Here's the proper solution:

docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]
like image 118
Maxim Avatar answered Sep 29 '22 17:09

Maxim