First let's extract the TF-IDF scores per term per document:
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
Printing it out:
for doc in corpus_tfidf:
print doc
[out]:
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]
If we want to find the "saliency" or "importance" of the words within this corpus, can we simple do the sum of the tf-idf scores across all documents and divide it by the number of documents? I.e.
>>> tfidf_saliency = Counter()
>>> for doc in corpus_tfidf:
... for word, score in doc:
... tfidf_saliency[word] += score / len(corpus_tfidf)
...
>>> tfidf_saliency
Counter({7: 0.12182694202050007, 8: 0.11121194156107769, 26: 0.10886469856464989, 29: 0.10093919463036093, 9: 0.09022272408985754, 14: 0.08705221175200946, 25: 0.08482488519466996, 6: 0.08143359568202602, 10: 0.07480097322359022, 12: 0.07480097322359022, 4: 0.07411881371164887, 13: 0.07117278898823597, 5: 0.07104525967490458, 27: 0.07027283689263066, 28: 0.07027283689263066, 11: 0.060487023243988705, 15: 0.055997035904387725, 16: 0.055997035904387725, 21: 0.05389680556362955, 22: 0.05389680556362955, 23: 0.05389680556362955, 24: 0.05389680556362955, 17: 0.048785635947490406, 18: 0.048785635947490406, 19: 0.048785635947490406, 20: 0.048785635947490406, 0: 0.04778910634833961, 1: 0.04778910634833961, 2: 0.04778910634833961, 3: 0.04778910634833961, 30: 0.045480127933079706, 31: 0.045480127933079706, 32: 0.045480127933079706, 33: 0.045480127933079706, 34: 0.045480127933079706})
Looking at the output, could we assume that the most "prominent" word in the corpus is:
>>> dictionary[7]
u'system'
>>> dictionary[8]
u'survey'
>>> dictionary[26]
u'graph'
If so, what is the mathematical interpretation of the sum of TF-IDF scores of words across documents?
Run a TF*IDF report for your words and get their weights. The higher the numerical weight value, the rarer the term. The smaller the weight, the more common the term. Compare all the terms with high TF*IDF weights with respect to their search volumes on the web.
Since TF-IDF weights words based on relevance, one can use this technique to determine that the words with the highest relevance are the most important. This can be used to help summarize articles more efficiently or to simply determine keywords (or even tags) for a document.
Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.
Notice that that idf score is higher if the term appears in fewer documents, but that the range of visible idf scores is between 1 and 6.
The interpretation of TF-IDF in corpus is the highest TF-IDF in corpus for a given term.
Find the Top Words in corpus_tfidf.
topWords = {}
for doc in corpus_tfidf:
for iWord, tf_idf in doc:
if iWord not in topWords:
topWords[iWord] = 0
if tf_idf > topWords[iWord]:
topWords[iWord] = tf_idf
for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
if i == 6: break
Output comparison cart:
NOTE: Could'n use gensim
, to create a matching dictionary
with corpus_tfidf
.
Can only display Word Indizies.
Question tfidf_saliency topWords(corpus_tfidf) Other TF-IDF implentation
---------------------------------------------------------------------------
1: Word(7) 0.121 1: Word(13) 0.640 1: paths 0.376019
2: Word(8) 0.111 2: Word(27) 0.632 2: intersection 0.376019
3: Word(26) 0.108 3: Word(28) 0.632 3: survey 0.366204
4: Word(29) 0.100 4: Word(8) 0.628 4: minors 0.366204
5: Word(9) 0.090 5: Word(29) 0.628 5: binary 0.300815
6: Word(14) 0.087 6: Word(11) 0.544 6: generation 0.300815
The calculation of TF-IDF takes always the corpus in account.
Tested with Python:3.4.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With