Doc2vec: How to get document vectors

People also ask

What is Doc to VEC?

Doc2vec is an NLP tool for representing documents as a vector and is a generalizing of the word2vec method.

Does Doc2Vec use Word2Vec?

Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input.

If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information).

# Import libraries

from gensim.models import doc2vec
from collections import namedtuple

# Load data

doc1 = ["This is a sentence", "This is another sentence"]

# Transform data (you can add more data preprocessing steps) 

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors

model.docvecs[0]
model.docvecs[1]

UPDATE (how to train in epochs): This example became outdated, so I deleted it. For more information on training in epochs, see this answer or @gojomo's comment.

Gensim was updated. The syntax of LabeledSentence does not contain labels. There are now tags - see documentation for LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html

However, @bee2502 was right with

docvec = model.docvecs[99]

It will should the 100th vector's value for trained model, it works with integers and strings.

doc=["This is a sentence","This is another sentence"]
documents=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)

I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format. I hope this below example will help you understand the format.

documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])

More details are here : http://rare-technologies.com/doc2vec-tutorial/ However, I solved the problem by taking input data from file using TaggedLineDocument().
File format: one document = one line = one TaggedDocument object. Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.

sentences=doc2vec.TaggedLineDocument(file_path)
model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)

To get document vector : You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

docvec = model.docvecs[99]

where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec

Related questions
                            
                                Why isn't .ico file defined when setting window's icon?
                            
                                How to update the image of a Tkinter Label widget?
                            
                                How do I add a title and axis labels to Seaborn Heatmap?
                            
                                how to add a coroutine to a running asyncio loop?
                            
                                How can I check for unused import in many Python files?
                            
                                Suppressing scientific notation in pandas?
                            
                                How to make a custom activation function with only Python in Tensorflow?
                            
                                summing two columns in a pandas dataframe
                            
                                Select multiple columns by labels in pandas
                            
                                Vim autocomplete for Python
                            
                                Python calling method in class
                            
                                How to call an external program in python and retrieve the output and return code?
                            
                                How to find newest file with .MP3 extension in directory?
                            
                                Get first row of dataframe in Python Pandas based on criteria
                            
                                Parsing a JSON string which was loaded from a CSV using Pandas
                            
                                Python: No csv.close()?
                            
                                Case-insensitive string startswith in Python
                            
                                Sum of all counts in a collections.Counter
                            
                                Switch between python 2.7 and python 3.5 on Mac OS X
                            
                                Get multiple request params of the same name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Doc2vec: How to get document vectors

Tags:

python

gensim

word2vec

People also ask

Recent Activity

Donate For Us