Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does gensim.corpora.Dictionary have term frequency saved?

Does gensim.corpora.Dictionary have term frequency saved?

From gensim.corpora.Dictionary, it's possible to get the document frequency of the words (i.e. how many document did a particular word occur in):

from nltk.corpus import brown
from gensim.corpora import Dictionary

documents = brown.sents()
brown_dict = Dictionary(documents)

# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')

[out]:

The word "these" appears in 1213 documents

And there is the filter_n_most_frequent(remove_n) function that can remove the n-th most frequent tokens:

filter_n_most_frequent(remove_n) Filter out the ‘remove_n’ most frequent tokens that appear in the documents.

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

Is the filter_n_most_frequent function removing the n-th most frequent based on the document frequency or term frequency?

If it's the latter, is there some way to access the term frequency of the words in the gensim.corpora.Dictionary object?

like image 242
alvas Avatar asked Oct 11 '17 09:10

alvas


People also ask

What is Gensim corpora dictionary?

class gensim.corpora. Dictionary (documents=None, prune_at=2000000)[source] Dictionary encapsulates the mapping between normalized words and their integer ids. The main function is doc2bow , which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.

How do you save in Gensim corpus?

To save the corpus data, use the serialize method of your desired output format instead, e.g. gensim. corpora.

How do you save a Gensim Tfidf model?

In general, you can save things with generic Python pickle , but most gensim models support their own native . save() method. It takes a target filesystem path, and will save the model more efficiently than pickle() – often by placing large component arrays in separate files, alongside the main file.


1 Answers

No, gensim.corpora.Dictionary does not save term frequency. You can see the source code here. The class only stores the following member variables:

    self.token2id = {}  # token -> tokenId
    self.id2token = {}  # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {}  # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0  # number of documents processed
    self.num_pos = 0  # total number of corpus positions
    self.num_nnz = 0  # total number of non-zeroes in the BOW matrix

This means everything in the class defines frequency as document frequency, never term frequency, as the latter is never stored globally. This applies to filter_n_most_frequent(remove_n) as well as every other method.

like image 172
ubadub Avatar answered Oct 25 '22 12:10

ubadub