I am using gensim
for some NLP task. I've created a corpus from dictionary.doc2bow
where dictionary
is an object of corpora.Dictionary
. Now I want to filter out the terms with low tf-idf values before running an LDA model. I looked into the documentation of the corpus class but cannot find a way to access the terms. Any ideas? Thank you.
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...
The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.
Can TF IDF Be Negative? No. The lowest value is 0. Both term frequency and inverse document frequency are positive numbers.
Say your corpus is the following:
corpus = [dictionary.doc2bow(doc) for doc in documents]
After running TFIDF you can retrieve a list of low value words:
tfidf = TfidfModel(corpus, id2word=dictionary)
low_value = 0.2
low_value_words = []
for bow in corpus:
low_value_words += [id for id, value in tfidf[bow] if value < low_value]
Then filter them out of the dictionary before running LDA:
dictionary.filter_tokens(bad_ids=low_value_words)
Recompute the corpus now that low value words are filtered out:
new_corpus = [dictionary.doc2bow(doc) for doc in documents]
This is old, but if you wanted to look at in on a per document level do something like this:
#same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)
#filter low value words
low_value = 0.025
for i in range(0, len(corpus)):
bow = corpus[i]
low_value_words = [] #reinitialize to be safe. You can skip this.
low_value_words = [id for id, value in tfidf[bow] if value < low_value]
new_bow = [b for b in bow if b[0] not in low_value_words]
#reassign
corpus[i] = new_bow
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With