I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs. The example of training data (length>10) <pre class="prettyprint"><code>docs = ['This is a sentence', 'This is another sentence', ....] </code></pre> with some pre-treatment <pre class="prettyprint"><code>doc_=[d.strip().split(" ") for d in doc] doc_tagged = [] for i in range(len(doc_)): tagd = TaggedDocument(doc_[i],str(i)) doc_tagged.append(tagd) </code></pre> tagged docs <pre class="prettyprint"><code>TaggedDocument(words=array(['a', 'b', 'c', ..., ], dtype='<U32'), tags='117') </code></pre> fit a doc2vec model <pre class="prettyprint"><code>model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8) model.build_vocab(doc_tagged) model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter) </code></pre> then i get the final model <pre class="prettyprint"><code>len(model.docvecs) </code></pre> the result is 10... I tried other datasets (length>100, 1000) and got same result of <code>len(model.docvecs)</code>. So, my question is: How to use model.docvecs to get full vectors? (without using <code>model.infer_vector</code>) Is <code>model.docvecs</code> designed to provide all training docvecs?

The bug is in this line: <pre class="prettyprint lang-py prettyprint-override"><code>tagd = TaggedDocument(doc[i],str(i)) </code></pre> Gensim's <code>TaggedDocument</code> accepts a sequence of tags as a second argument. When you pass a string <code>'123'</code>, it's turned into <code>['1', '2', '3']</code>, because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags <code>['0', ..., '9']</code>, in various combinations. Another issue: you're defining <code>doc_</code> and never actually using it, so your documents will be split incorrectly as well. Here's the proper solution: <pre class="prettyprint lang-py prettyprint-override"><code>docs = [doc.strip().split(' ') for doc in docs] tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)] </code></pre>

Doc2vec: Only 10 docvecs in gensim doc2vec model?

Tags:

machine-learning

nlp

gensim

word2vec

doc2vec

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.

The example of training data (length>10)

docs = ['This is a sentence', 'This is another sentence', ....]

with some pre-treatment

doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
  tagd = TaggedDocument(doc_[i],str(i))
  doc_tagged.append(tagd)

tagged docs

TaggedDocument(words=array(['a', 'b', 'c', ..., ],
  dtype='<U32'), tags='117')

fit a doc2vec model

model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)

then i get the final model

len(model.docvecs)

the result is 10...

I tried other datasets (length>100, 1000) and got same result of len(model.docvecs). So, my question is: How to use model.docvecs to get full vectors? (without using model.infer_vector) Is model.docvecs designed to provide all training docvecs?

546

asked Feb 28 '18 03:02

GemOfRoe

1 Answers

The bug is in this line:

tagd = TaggedDocument(doc[i],str(i))

Gensim's TaggedDocument accepts a sequence of tags as a second argument. When you pass a string '123', it's turned into ['1', '2', '3'], because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags ['0', ..., '9'], in various combinations.

Another issue: you're defining doc_ and never actually using it, so your documents will be split incorrectly as well.

Here's the proper solution:

docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]

118

answered Sep 29 '22 17:09

Maxim

Related questions
                            
                                Handle invalid/corrupted image files in ImageDataGenerator.flow_from_directory in Keras
                            
                                XGBModel' object has no attribute 'evals_result_'
                            
                                How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?
                            
                                Regarding odd image dimensions in Pytorch
                            
                                How inverting the dropout compensates the effect of dropout and keeps expected values unchanged?
                            
                                How are the TokenEmbeddings in BERT created?
                            
                                Balanced Accuracy Score in Tensorflow
                            
                                Display Pytorch tensor as image using Matplotlib
                            
                                Amazon EC2 vs PiCloud [closed]
                            
                                How to deal with missing attribute values in C4.5 (J48) decision tree?
                            
                                Special characters in countVectorizer Scikit-learn
                            
                                How to obtain the training error in svm of Scikit-learn?
                            
                                How do I detect if a photo is a poster (not realistic)?
                            
                                How do I do classification using TfidfVectorizer plus metadata in practice?
                            
                                Caffe output layer number accuracy
                            
                                Not reading all rows while importing csv into pandas dataframe
                            
                                how to obtain the trained best model from a crossvalidator
                            
                                converting scipy.sparse.csr.csr_matrix to a list of lists
                            
                                Adding sparse matrix from CountVectorizer into dataframe with complimentary information for classifier - keep it in sparse format
                            
                                Why is validation accuracy higher than training accuracy when applying data augmentation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With