how to add tokens to gensim dictionary

Question

I use gensim to build dictionary from a collection of documents. Each document is a list of tokens. this my code

def constructModel(self, docTokens):
    """ Given document tokens, constructs the tf-idf and similarity models"""

    #construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs
    #print "dictionary"
    self.dictionary = corpora.Dictionary(docTokens)

    # prune dictionary: remove words that appear too infrequently or too frequently
    print "dictionary size before filter_extremes:",self.dictionary#len(self.dictionary.values())
    #self.dictionary.filter_extremes(no_below=1, no_above=0.9, keep_n=100000)
    #self.dictionary.compactify()

    print "dictionary size after filter_extremes:",self.dictionary

    #construct the corpus bow vectors; bow vector = collection of (word_id,word_frequency) pairs
    corpus_bow = [self.dictionary.doc2bow(doc) for doc in docTokens]


    #construct the tf-idf model 
    self.model = models.TfidfModel(corpus_bow,normalize=True)
    corpus_tfidf = self.model[corpus_bow]   # first transform each raw bow vector in the corpus to the tfidf model's vector space
    self.similarityModel = similarities.MatrixSimilarity(corpus_tfidf)  # construct the term-document index

my question is how to add a new doc (tokens) to this dictionary and update it. I searched in gensim documents but I didn't find a solution

sinwav · Accepted Answer

There is documentation for how to do this on the gensim webpage here

The way to do it is create another dictionary with the new documents and then merge them.

from gensim import corpora

dict1 = corpora.Dictionary(firstDocs)
dict2 = corpora.Dictionary(moreDocs)
dict1.merge_with(dict2)

According to the docs, this will map "same tokens to the same ids and new tokens to new ids".

how to add tokens to gensim dictionary

Tags:

python

gensim

topic-modeling

topicmodels

Athari

1 Answers

sinwav

Recent Activity

Donate For Us

how to add tokens to gensim dictionary

Tags:

python

gensim

topic-modeling

topicmodels

Athari

1 Answers

sinwav

Related questions

Recent Activity

Donate For Us