Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does gensim calculate doc2vec paragraph vectors

i am going thorugh this paper http://cs.stanford.edu/~quocle/paragraph_vector.pdf

and it states that

" Theparagraph vector and word vectors are averaged or concatenated to predict the next word in a context. In the experiments, we use concatenation as the method to combine the vectors."

How does concatenation or averaging work?

example (if paragraph 1 contain word1 and word2):

word1 vector =[0.1,0.2,0.3]
word2 vector =[0.4,0.5,0.6]

concat method 
does paragraph vector = [0.1+0.4,0.2+0.5,0.3+0.6] ?

Average method 
does paragraph vector = [(0.1+0.4)/2,(0.2+0.5)/2,(0.3+0.6)/2] ?

Also from this image:

It is stated that :

The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph. For this reason, we often call this model the Distributed Memory Model of Paragraph Vectors (PV-DM).

Is the paragraph token equal to the paragraph vector which is equal to on?

enter image description here

like image 328
jxn Avatar asked Nov 04 '16 01:11

jxn


People also ask

How does Doc2Vec model work?

Doc2vec also uses and unsupervised learning approach to learn the document representation . The input of texts (i.e. word) per document can be various while the output is fixed-length vectors. Paragraph vector and word vectors are initialized.

How does Doc2Vec inference work?

First, we transform each word in the corpus to a vector using the traditional Word2Vec algorithm. Softmax layer outputs the vector representation of the Document. The model trains until all weights are setup in a way to achieves the highest prediction probabilities (or as close it can get).

What is vector size in Doc2Vec?

The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents.


1 Answers

How does concatenation or averaging work?

You got it right for the average. The concatenation is: [0.1,0.2,0.3,0.4,0.5,0.6].

Is the paragraph token equal to the paragraph vector which is equal to on?

The "paragraph token" is mapped to a vector that is called "paragraph vector". It is different from the token "on", and different from the word vector that the token "on" is mapped to.

like image 114
Franck Dernoncourt Avatar answered Oct 27 '22 22:10

Franck Dernoncourt