I'm using the Doc2Vec tags as an unique identifier for my documents, each document has a different tag and no semantic meaning. I'm using the tags to find specific documents so I can calculate the similarity between them. Do the tags influence the results of my model? In this tutorial they talk about a parameter <code>train_lbls=false</code>, with this set to false there are no representations learned for the labels (tags). That tutorial is somewhat dated and I guess the parameter does no longer exist, how does Doc2Vec handle tags?

For gensim's Doc2Vec, your text examples must be objects similar to the example <code>TaggedDocument</code> class: with <code>words</code> and <code>tags</code> properties. The <code>tags</code> property should be a list of 'tags', which serve as keys to the doc-vectors that will be learned from the corresponding text. In the classic/original case, each document has a single tag – essentially a unique ID for that one document. (Tags can be strings, but for very large corpuses, Doc2Vec will use somewhat less memory if you instead use tags that are plain Python ints, starting from 0, with no skipped values.) The tags are used to look-up the learned vectors after training. If you had a document during training with the single tag <code>'mars'</code>, you'd look-up the learned vector with: <pre class="prettyprint"><code>model.docvecs['mars'] </code></pre> If you were do a <code>model.docvecs.most_similar['mars']</code> call, the results will be reported by their tag keys, as well. The tags are just keys into the doc-vectors collection – they have no semantic meaning, and even if a string is repeated from the word-tokens in the text, there's no necessary relation between this tag key and the word. That is, if you have a document whose single ID tag is 'mars', there's no essential relationship between the learned doc-vector accessed via that key (<code>model.docvecs['mars']</code>), and any learned word-vector accessed with the same string key (<code>model.wv['mars']</code>) – they're coming from separate collections-of-vectors.

Does Doc2Vec learn representations for the tags?

Tags:

gensim

doc2vec

I'm using the Doc2Vec tags as an unique identifier for my documents, each document has a different tag and no semantic meaning. I'm using the tags to find specific documents so I can calculate the similarity between them.

Do the tags influence the results of my model?

In this tutorial they talk about a parameter train_lbls=false, with this set to false there are no representations learned for the labels (tags).

That tutorial is somewhat dated and I guess the parameter does no longer exist, how does Doc2Vec handle tags?

389

asked Apr 21 '17 13:04

Stanko

1 Answers

For gensim's Doc2Vec, your text examples must be objects similar to the example TaggedDocument class: with words and tags properties. The tags property should be a list of 'tags', which serve as keys to the doc-vectors that will be learned from the corresponding text.

In the classic/original case, each document has a single tag – essentially a unique ID for that one document. (Tags can be strings, but for very large corpuses, Doc2Vec will use somewhat less memory if you instead use tags that are plain Python ints, starting from 0, with no skipped values.)

The tags are used to look-up the learned vectors after training. If you had a document during training with the single tag 'mars', you'd look-up the learned vector with:

model.docvecs['mars']

If you were do a model.docvecs.most_similar['mars'] call, the results will be reported by their tag keys, as well.

The tags are just keys into the doc-vectors collection – they have no semantic meaning, and even if a string is repeated from the word-tokens in the text, there's no necessary relation between this tag key and the word.

That is, if you have a document whose single ID tag is 'mars', there's no essential relationship between the learned doc-vector accessed via that key (model.docvecs['mars']), and any learned word-vector accessed with the same string key (model.wv['mars']) – they're coming from separate collections-of-vectors.

answered Oct 23 '22 12:10

gojomo

Related questions
                            
                                How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?
                            
                                Finding topics of an unseen document via Gensim
                            
                                Understanding LDA Transformed Corpus in Gensim
                            
                                How much data is actually required to train a doc2Vec model?
                            
                                gensim - Word2vec continue training on existing model - AttributeError: 'Word2Vec' object has no attribute 'compute_loss'
                            
                                Python/Gensim - What is the meaning of syn0 and syn0norm?
                            
                                Measure similarity between two documents using Doc2Vec
                            
                                Are there any efficient python libraries for Dynamic Topic Models, preferably extending Gensim?
                            
                                How to get document_topics distribution of all of the document in gensim LDA?
                            
                                Gensim LDA Coherence Score Nan
                            
                                Is it possible to use gensim word2vec model in deeplearning4j.word2vec?
                            
                                Word2vec Gensim Accuracy Analysis
                            
                                Loss does not decrease during training (Word2Vec, Gensim)
                            
                                Gensim Dictionary Implementation
                            
                                Doc2vec: Only 10 docvecs in gensim doc2vec model?
                            
                                What does epochs mean in Doc2Vec and train when I have to manually run the iteration?
                            
                                Gensim (word2vec) retrieve n most frequent words
                            
                                Semantic Similarity between Phrases Using GenSim
                            
                                Does gensim.corpora.Dictionary have term frequency saved?
                            
                                Doc2vec MemoryError

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With