Does gensim.corpora.Dictionary have term frequency saved?
From gensim.corpora.Dictionary
, it's possible to get the document frequency of the words (i.e. how many document did a particular word occur in):
from nltk.corpus import brown
from gensim.corpora import Dictionary
documents = brown.sents()
brown_dict = Dictionary(documents)
# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')
[out]:
The word "these" appears in 1213 documents
And there is the filter_n_most_frequent(remove_n)
function that can remove the n-th most frequent tokens:
filter_n_most_frequent(remove_n)
Filter out the ‘remove_n’ most frequent tokens that appear in the documents.After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
Is the filter_n_most_frequent
function removing the n-th most frequent based on the document frequency or term frequency?
If it's the latter, is there some way to access the term frequency of the words in the gensim.corpora.Dictionary
object?
class gensim.corpora. Dictionary (documents=None, prune_at=2000000)[source] Dictionary encapsulates the mapping between normalized words and their integer ids. The main function is doc2bow , which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.
To save the corpus data, use the serialize method of your desired output format instead, e.g. gensim. corpora.
In general, you can save things with generic Python pickle , but most gensim models support their own native . save() method. It takes a target filesystem path, and will save the model more efficiently than pickle() – often by placing large component arrays in separate files, alongside the main file.
No, gensim.corpora.Dictionary
does not save term frequency. You can see the source code here. The class only stores the following member variables:
self.token2id = {} # token -> tokenId
self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared
self.num_docs = 0 # number of documents processed
self.num_pos = 0 # total number of corpus positions
self.num_nnz = 0 # total number of non-zeroes in the BOW matrix
This means everything in the class defines frequency as document frequency, never term frequency, as the latter is never stored globally. This applies to filter_n_most_frequent(remove_n)
as well as every other method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With