Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Doc2Vec: Differentiate Sentence and Document

Tags:

I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates.

The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences.

But the original paper: Doc2Vec-Paper claims that the method can be used to infer fixed length vectors of paragraphs/documents.

Can someone explain the difference between a sentence and a document in this context, and how i would go about inferring paragraph vectors.

Since a question can sometimes span multiple sentences, I thought, during training i will give sentences arising from the same question the same tags, but then how would i do this to infer_vector on unseen questions?

And this notebook : Doc2Vec-Notebook

seems to be training vectors on TRAIN and TEST docs, can someone explain the rationale behind this and should i do the same?

like image 587
Vikash Balasubramanian Avatar asked Feb 15 '17 06:02

Vikash Balasubramanian


People also ask

What is the difference between Word2Vec and Doc2Vec?

While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus. Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input.

What is Doc2Vec model?

Doc2vec is an NLP tool for representing documents as a vector and is a generalizing of the word2vec method. In order to understand doc2vec, it is advisable to understand word2vec approach.

What is Doc2Vec paragraph ID?

Note: Paragraph ID is unique document ID. Now like word2vec there are two flavor of doc2vec are available: Distributed Memory Model of Paragraph Vectors (PV-DM) Distributed Bag of Words version of Paragraph Vector (PV-DBOW)


1 Answers

Gensim's Doc2Vec expects you to provide text examples of the same object-shape as the example TaggedDocument class: having both a words and a tags property.

The words are an ordered sequence of string tokens of the text – they might be a single sentence worth, or a paragraph, or a long document, it's up to you.

The tags are a list of tags to be learned from the text – such as plain ints, or string-tokens, that somehow serve to name the corresponding texts. In the original 'Paragraph Vectors' paper, they were just unique IDs for each text – such as integers monotonically increasing from 0. (So the first TaggedDocument might have a tags of just [0], the next [1], etc.)

The algorithm just works on chunks of text, without any idea of what a sentence/paragraph/document etc might be. (Just consider them all 'documents' for the purpose of Doc2Vec, with you deciding what's the right kind of 'document' from your corpus.) It's even common for the tokenization to retain punctuation, such as the periods between sentences, as standalone tokens.

Inference occurs via the infer_vector() method, which takes a mandatory parameter doc_words, which should be a list-of-string-tokens just like those that were supplied as text words during training.

You don't supply any tags on inferred text: Doc2Vec just gives you back a raw vector that, within the relationships learned by the model, fits the text well. (That is: the vector is good at predicting the text's words, in the same way that the vectors and internal model weights learned during bulk training were good at prediction the training texts' words.)

Note that many have found better results from inference by increasing the optional steps parameter (and possibly decreasing the inference starting alpha to be more like the bulk-training starting alpha, 0.025 to 0.05).

The doc2vec-IMDB demo notebook tries to reproduce one of the experiments from the original Paragraph Vectors paper, so it's following what's described there, and a demo script that one of the authors (Mikolov) once released. Since 'test' documents (withoout their target-labels/known-sentiments) may still be available, at training time, to help improve the text-modelling, it can be reasonable to include their raw texts during the unsupervised Doc2Vec training. (Their known-labels are not used when training the classifier which uses the doc-vectors.)

(Note that at the moment, February 2017, the doc2vec-IMDB demo notebook is a little out-of-date compared to the current gensim Doc2Vec defaults & best-practices – in particular the models aren't given the right explicit iter=1 value to make the later manual loop-and-train() do just the right umber of training passes.)

like image 91
gojomo Avatar answered Sep 21 '22 09:09

gojomo