I am trying to find the most important words in a corpus based on their TF-IDF scores.
Been following along the example at https://radimrehurek.com/gensim/tut2.html. Based on
>>> for doc in corpus_tfidf:
... print(doc)
the TF-IDF score is getting updated in each iteration. For example,
So here's how I am currently getting the final TF-IDF score for each word,
tfidf = gensim.models.tfidfmodel.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
d = {}
for doc in corpus_tfidf:
for id, value in doc:
word = dictionary.get(id)
d[word] = value
Is there a better way?
Thanks in advance.
As its name implies, TF-IDF vectorizes/scores a word by multiplying the word's Term Frequency (TF) with the Inverse Document Frequency (IDF). Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...
The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...
How about using dictionary comprehensions?
d = {dictionary.get(id): value for doc in corpus_tfidf for id, value in doc}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With