How to use TaggedDocument in gensim?

Tags:

I have two directories from which I want to read their text files and label them, but I don't know how to do this via TaggedDocument. I thought it would work as TaggedDocument([Strings],[Labels]) but this doesn't work apparently.

This is my code:

from gensim import models
from gensim.models.doc2vec import TaggedDocument
import utilities as util
import os
from sklearn import svm
from nltk.tokenize import sent_tokenize
CogPath = "./FixedCog/"
NotCogPath = "./FixedNotCog/"
SamplePath ="./Sample/"
docs = []
tags = []
CogList = [p for p in os.listdir(CogPath) if p.endswith('.txt')]
NotCogList = [p for p in os.listdir(NotCogPath) if p.endswith('.txt')]
SampleList = [p for p in os.listdir(SamplePath) if p.endswith('.txt')]
for doc in CogList:
     str = open(CogPath+doc,'r').read().decode("utf-8")
     docs.append(str)
     print docs
     tags.append(doc)
     print "###########"
     print tags
     print "!!!!!!!!!!!"
for doc in NotCogList:
     str = open(NotCogPath+doc,'r').read().decode("utf-8")
     docs.append(str)
     tags.append(doc)
for doc in SampleList:
     str = open(SamplePath + doc, 'r').read().decode("utf-8")
     docs.append(str)
     tags.append(doc)

T = TaggedDocument(docs,tags)

model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

and this is the error I get:

Traceback (most recent call last):
  File "/home/farhood/PycharmProjects/word2vec_prj/doc2vec.py", line 34, in <module>
    model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)
  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 635, in __init__
    self.build_vocab(documents, trim_rule=trim_rule)
  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 544, in build_vocab
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 674, in scan_vocab
    if isinstance(document.words, string_types):
AttributeError: 'list' object has no attribute 'words'

872

asked Jul 16 '17 06:07

Farhood

2 Answers

The input for a Doc2Vec model should be a list of TaggedDocument(['list','of','word'], [TAG_001]). A good practice is using the indexes of sentences as the tags. For example, to train a Doc2Vec model with two sentences (i.e. documents, paragraphs):

s1 = 'the quick fox brown fox jumps over the lazy dog'
s1_tag = '001'
s2 = 'i want to burn a zero-day'
s2_tag = '002'

docs = []
docs.append(TaggedDocument(words=s1.split(), tags=[s1_tag])
docs.append(TaggedDocument(words=s2.split(), tags=[s2_tag])

model = gensim.models.Doc2Vec(vector_size=300, window=5, min_count=5, workers=4, epochs=20)
model.build_vocab(docs)

print 'Start training process...'
model.train(docs, total_examples=model.corpus_count, epochs=model.iter)

#save model
model.save(model_path)

answered Oct 04 '22 00:10

biendltb

So I just experimented a bit and found this on github:

class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):
    """
    A single document, made up of `words` (a list of unicode string tokens)
    and `tags` (a list of tokens). Tags may be one or more unicode string
    tokens, but typical practice (which will also be most memory-efficient) is
    for the tags list to include a unique integer id as the only tag.

    Replaces "sentence as a list of words" from Word2Vec.

so I decided to change how I use the TaggedDocument function by generating a TaggedDocument class for each document, the important thing is that you have to pass the tags as a list.

for doc in CogList:
     str = open(CogPath+doc,'r').read().decode("utf-8")
     str_list = str.split()
     T = TaggedDocument(str_list,[doc])
     docs.append(T)

answered Oct 04 '22 00:10

Farhood

Related questions
                            
                                Does CMake support Python3?
                            
                                Searching PyPI by topic
                            
                                Build wheel for a package (like scipy) lacking dependency declaration
                            
                                How to upload data in bulk to the appengine datastore? Older methods do not work
                            
                                How to load JSON data into nested classes?
                            
                                How to create a Scapy packet from raw bytes
                            
                                how to get raw html text of a given url using python
                            
                                Flask Testing - why does coverage exclude import statements and decorators?
                            
                                How to get alternating colours in dashed line using matplotlib?
                            
                                How to interactively display and hide lines in a Bokeh plot?
                            
                                Preserve index-string correspondence spark string indexer
                            
                                (how) can you train a model twice (multiple times) in sklearn using fit.?
                            
                                How to close the Python turtle window after it does its code?
                            
                                Get params validation on viewsets.ModelViewSet
                            
                                How to tell if a single line of python is syntactically valid?
                            
                                Convert string to dict, then access key:values??? How to access data in a <class 'dict'> for Python?
                            
                                Use tf.scatter_update in a two dimensional tf.Variable
                            
                                How to upgrade the classifier to the latest version of scikit-learn
                            
                                "WARNING conda.gateways.disk:exp_backoff_fn(47): Uncaught backoff with errno 41" during "conda install"
                            
                                Installing numpy for Windows 10: Importing the multiarray numpy extension module failed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use TaggedDocument in gensim?

Tags:

python

nltk

gensim

word2vec

doc2vec

Farhood

People also ask

2 Answers

biendltb

Farhood

Recent Activity

Donate For Us