Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter out words with low tf-idf in a corpus with gensim?

Tags:

python

nlp

gensim

I am using gensim for some NLP task. I've created a corpus from dictionary.doc2bow where dictionary is an object of corpora.Dictionary. Now I want to filter out the terms with low tf-idf values before running an LDA model. I looked into the documentation of the corpus class but cannot find a way to access the terms. Any ideas? Thank you.

like image 842
ziyuang Avatar asked Jul 10 '14 23:07

ziyuang


People also ask

What is TF-IDF explain how it helps to information retrieval from large corpus?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...

How do I find my TF-IDF value?

The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.

Can TF-IDF values be negative?

Can TF IDF Be Negative? No. The lowest value is 0. Both term frequency and inverse document frequency are positive numbers.


2 Answers

Say your corpus is the following:

corpus = [dictionary.doc2bow(doc) for doc in documents]

After running TFIDF you can retrieve a list of low value words:

tfidf = TfidfModel(corpus, id2word=dictionary)

low_value = 0.2
low_value_words = []
for bow in corpus:
    low_value_words += [id for id, value in tfidf[bow] if value < low_value]

Then filter them out of the dictionary before running LDA:

dictionary.filter_tokens(bad_ids=low_value_words)

Recompute the corpus now that low value words are filtered out:

new_corpus = [dictionary.doc2bow(doc) for doc in documents]
like image 159
interpolack Avatar answered Oct 03 '22 08:10

interpolack


This is old, but if you wanted to look at in on a per document level do something like this:

#same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)

#filter low value words
low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    new_bow = [b for b in bow if b[0] not in low_value_words]

    #reassign        
    corpus[i] = new_bow
like image 30
Bryan Goggin Avatar answered Oct 03 '22 08:10

Bryan Goggin