Doc2Vec: Differentiate Sentence and Document

Tags:

I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates.

The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences.

But the original paper: Doc2Vec-Paper claims that the method can be used to infer fixed length vectors of paragraphs/documents.

Can someone explain the difference between a sentence and a document in this context, and how i would go about inferring paragraph vectors.

Since a question can sometimes span multiple sentences, I thought, during training i will give sentences arising from the same question the same tags, but then how would i do this to infer_vector on unseen questions?

And this notebook : Doc2Vec-Notebook

seems to be training vectors on TRAIN and TEST docs, can someone explain the rationale behind this and should i do the same?

587

asked Feb 15 '17 06:02

Vikash Balasubramanian

1 Answers

Gensim's Doc2Vec expects you to provide text examples of the same object-shape as the example TaggedDocument class: having both a words and a tags property.

The words are an ordered sequence of string tokens of the text – they might be a single sentence worth, or a paragraph, or a long document, it's up to you.

The tags are a list of tags to be learned from the text – such as plain ints, or string-tokens, that somehow serve to name the corresponding texts. In the original 'Paragraph Vectors' paper, they were just unique IDs for each text – such as integers monotonically increasing from 0. (So the first TaggedDocument might have a tags of just [0], the next [1], etc.)

The algorithm just works on chunks of text, without any idea of what a sentence/paragraph/document etc might be. (Just consider them all 'documents' for the purpose of Doc2Vec, with you deciding what's the right kind of 'document' from your corpus.) It's even common for the tokenization to retain punctuation, such as the periods between sentences, as standalone tokens.

Inference occurs via the infer_vector() method, which takes a mandatory parameter doc_words, which should be a list-of-string-tokens just like those that were supplied as text words during training.

You don't supply any tags on inferred text: Doc2Vec just gives you back a raw vector that, within the relationships learned by the model, fits the text well. (That is: the vector is good at predicting the text's words, in the same way that the vectors and internal model weights learned during bulk training were good at prediction the training texts' words.)

Note that many have found better results from inference by increasing the optional steps parameter (and possibly decreasing the inference starting alpha to be more like the bulk-training starting alpha, 0.025 to 0.05).

The doc2vec-IMDB demo notebook tries to reproduce one of the experiments from the original Paragraph Vectors paper, so it's following what's described there, and a demo script that one of the authors (Mikolov) once released. Since 'test' documents (withoout their target-labels/known-sentiments) may still be available, at training time, to help improve the text-modelling, it can be reasonable to include their raw texts during the unsupervised Doc2Vec training. (Their known-labels are not used when training the classifier which uses the doc-vectors.)

(Note that at the moment, February 2017, the doc2vec-IMDB demo notebook is a little out-of-date compared to the current gensim Doc2Vec defaults & best-practices – in particular the models aren't given the right explicit iter=1 value to make the later manual loop-and-train() do just the right umber of training passes.)

answered Sep 21 '22 09:09

gojomo

Related questions
                            
                                How to parse a LinkedIn page
                            
                                in a json file I'm getting illegal escape sequence
                            
                                How can I remove the "padding" of a QTabWidget?
                            
                                java8 stream style to convert a key-value list to a map?
                            
                                C++ SOCI query into vector of custom Object
                            
                                Nativescript FormattedString - tap event on a Span
                            
                                Python - Drop duplicate based on max value of a column
                            
                                membership test in pandas data frame column
                            
                                top: 50%; not working in Safari
                            
                                How to avoid "Invalid byte sequence" when looking for link with text using Nokogiri
                            
                                Storing a UUID in Cloud Spanner
                            
                                Programmatic access to old and new values of a watchpoint in gdb

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With