Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Doc2Vec learn representations for the tags?

Tags:

gensim

doc2vec

I'm using the Doc2Vec tags as an unique identifier for my documents, each document has a different tag and no semantic meaning. I'm using the tags to find specific documents so I can calculate the similarity between them.

Do the tags influence the results of my model?

In this tutorial they talk about a parameter train_lbls=false, with this set to false there are no representations learned for the labels (tags).

That tutorial is somewhat dated and I guess the parameter does no longer exist, how does Doc2Vec handle tags?

like image 389
Stanko Avatar asked Apr 21 '17 13:04

Stanko


People also ask

What are tags in Doc2Vec?

The tags are just keys into the doc-vectors collection – they have no semantic meaning, and even if a string is repeated from the word-tokens in the text, there's no necessary relation between this tag key and the word.

How is Doc2Vec different from Word2Vec?

Doc2Vec is another widely used technique that creates an embedding of a document irrespective to its length. While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus.

How is Doc2Vec trained?

DBOW is the doc2vec model analogous to Skip-gram model in word2vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.


1 Answers

For gensim's Doc2Vec, your text examples must be objects similar to the example TaggedDocument class: with words and tags properties. The tags property should be a list of 'tags', which serve as keys to the doc-vectors that will be learned from the corresponding text.

In the classic/original case, each document has a single tag – essentially a unique ID for that one document. (Tags can be strings, but for very large corpuses, Doc2Vec will use somewhat less memory if you instead use tags that are plain Python ints, starting from 0, with no skipped values.)

The tags are used to look-up the learned vectors after training. If you had a document during training with the single tag 'mars', you'd look-up the learned vector with:

model.docvecs['mars']

If you were do a model.docvecs.most_similar['mars'] call, the results will be reported by their tag keys, as well.

The tags are just keys into the doc-vectors collection – they have no semantic meaning, and even if a string is repeated from the word-tokens in the text, there's no necessary relation between this tag key and the word.

That is, if you have a document whose single ID tag is 'mars', there's no essential relationship between the learned doc-vector accessed via that key (model.docvecs['mars']), and any learned word-vector accessed with the same string key (model.wv['mars']) – they're coming from separate collections-of-vectors.

like image 56
gojomo Avatar answered Oct 23 '22 12:10

gojomo