I'm trying to use Textacy to calculate the TF-IDF score for a single word across the standard corpus, but am a bit unclear about the result I am receiving.
I was expecting a single float which represented the frequency of the word in the corpus. So why am I receiving a list (?) of 7 results?
"acculer" is actually a French word, so was expecting a result of 0 from an English corpus.
word = 'acculer'
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
tf_idf = vectorizer.fit_transform(word)
logger.info("tf_idf:")
logger.info(tfidf)
Output
tf_idf:
(0, 0) 2.386294361119891
(1, 1) 1.9808292530117262
(2, 1) 1.9808292530117262
(3, 5) 2.386294361119891
(4, 3) 2.386294361119891
(5, 2) 2.386294361119891
(6, 4) 2.386294361119891
The second part of the question is how can I provide my own corpus to the TF-IDF function in Textacy, esp. one in a different language?
EDIT
As mentioned by @Vishal I have logged the ouput using this line:
logger.info(vectorizer.vocabulary_terms)
It seems the provided word acculer
has been split into characters.
{'a': 0, 'c': 1, 'u': 5, 'l': 3, 'e': 2, 'r': 4}
(1) How can I get the TF-IDF for this word against the corpus, rather than each character?
(2) How can I provide my own corpus and point to it as a param?
(3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.
The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...
Lets get definitions clear before looking into the actual questions.
Assume our corpus contains 3 documents (d1, d2 and d3 respectively):
corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
tf (of a word) is defined as number of times a word appears in a document.
tf(word, document) = count(word, document) # Number of times word appears in the document
tf is defined for a word at document level.
tf('a',d1) = 1 tf('a',d2) = 1 tf('a',d3) = 1
tf('apple',d1) = 1 tf('apple',d2) = 1 tf('apple',d3) = 0
tf('cat',d1) = 0 tf('cat',d2) = 0 tf('cat',d3) = 1
tf('green',d1) = 0 tf('green',d2) = 1 tf('green',d3) = 0
tf('is',d1) = 1 tf('is',d2) = 1 tf('is',d3) = 1
tf('red',d1) = 1 tf('red',d2) = 0 tf('red',d3) = 0
tf('this',d1) = 1 tf('this',d2) = 1 tf('this',d3) = 1
Using the raw counts has a problem that the tf
values of words in longer documents have high values compared to the shorter document. This problem can be solved by normalizing the raw count values by dividing by the document length (number of words in the corresponding document). This is called l1
normalization. The document d1
can now be represented by the tf vector
with all tf
values of all the words in the vocubulary of the corpus. There is an another kind of normalizaiton called l2
which makes the l2
norm of the tf vector of the document equal to 1.
tf(word, document, normalize='l1') = count(word, document)/|document|
tf(word, document, normalize='l2') = count(word, document)/l2_norm(document)
|d1| = 5, |d2| = 5, |d3| = 4
l2_norm(d1) = 0.447, l2_norm(d2) = 0.447, l2_norm(d3) = 0.5,
Code : tf
corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
# Convert docs to textacy format
textacy_docs = [textacy.Doc(doc) for doc in corpus]
for norm in [None, 'l1', 'l2']:
# tokenize the documents
tokenized_docs = [
doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
for doc in textacy_docs]
# Fit the tf matrix
vectorizer = textacy.Vectorizer(apply_idf=False, norm=norm)
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("TF with {0} normalize".format(norm))
print (doc_term_matrix.toarray())
Output:
Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with None normalize
[[1 1 0 0 1 1 1]
[1 1 0 1 1 0 1]
[1 0 1 0 1 0 1]]
Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l1 normalize
[[0.2 0.2 0. 0. 0.2 0.2 0.2 ]
[0.2 0.2 0. 0.2 0.2 0. 0.2 ]
[0.25 0. 0.25 0. 0.25 0. 0.25]]
Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l2 normalize
[[0.4472136 0.4472136 0. 0. 0.4472136 0.4472136 0.4472136]
[0.4472136 0.4472136 0. 0.4472136 0.4472136 0. 0.4472136]
[0.5 0. 0.5 0. 0.5 0. 0.5 ]]
The rows in the tf
matrix correspond to documents (hence 3 rows for our corpus) and columns correspond to each word in the vocabulary (index of the word shown in the vocabulary dictionary)
Some words convey less information then others. For example words like the, a, an, this, that are very common words and they convey very less information. idf is a measure of the importance of the word. We consider a word appearing in many documents to be less informative compared to words appearing in few documents.
idf(word, corpus) = log(|corpus| / No:of documents containing word) + 1 # standard idf
For our corpus intuitively idf(apple, corpus) < idf(cat,corpus)
idf('apple', corpus) = log(3/2) + 1 = 1.405
idf('cat', corpus) = log(3/1) + 1 = 2.098
idf('this', corpus) = log(3/3) + 1 = 1.0
Code : idf
textacy_docs = [textacy.Doc(doc) for doc in corpus]
tokenized_docs = [
doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
for doc in textacy_docs]
vectorizer = textacy.Vectorizer(apply_idf=False, norm=None)
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("standard idf: ")
print (textacy.vsm.matrix_utils.get_inverse_doc_freqs(doc_term_matrix, type_='standard'))
Output:
Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
standard idf:
[1. 1.405 2.098 2.098 1. 2.098 1.]
Term Frequency–Inverse Document Frequency(tf-idf): tf-idf is a a measure of how important a word is in the document in a corpus. tf of word weighted with its ids gives us the tf-idf measure of the word.
tf-idf(word, document, corpus) = tf(word, docuemnt) * idf(word, corpus)
tf-idf('apple', 'd1', corpus) = tf('apple', 'd1') * idf('apple', corpus) = 1 * 1.405 = 1.405
tf-idf('cat', 'd3', corpus) = tf('cat', 'd3') * idf('cat', corpus) = 1 * 2.098 = 2.098
Code : tf-idf
textacy_docs = [textacy.Doc(doc) for doc in corpus]
tokenized_docs = [
doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
for doc in textacy_docs]
print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("tf-idf: ")
vectorizer = textacy.Vectorizer(apply_idf=True, norm=None, idf_type='standard')
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print (doc_term_matrix.toarray())
Output:
Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
tf-idf:
[[1. 1.405 0. 0. 1. 2.098 1. ]
[1. 1.405 0. 2.098 1. 0. 1. ]
[1. 0. 2.098 0. 1. 0. 1. ]]
(1) How can I get the TF-IDF for this word against the corpus, rather than each character?
As seen above, there is no tf-idf
defined independently, tf-idf
of a word is with respect to a document in a corpus.
(2) How can I provide my own corpus and point to it as a param?
It is shown in the above samples.
tf (raw counts): apply_idf=False, norm=None
tf (l1 normalized): apply_idf=False, norm='l1'
tf (l2 normalized): apply_idf=False, norm='l2'
tf-idf (standard): apply_idf=True, idf_type='standard'
(3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.
Yes you can, if and only if you treat each sentence as a separate document. In such a case the tf-idf
vector (full row) of the corresponding document can be treated as a vector representation of the document (which is a single sentence in your case).
In case of our corpus (which infact contains a single sentence per document), the vector representation of d1 and d2 should be close as compared to vectors d1 and d3. Lets check cosin similarity and see :
cosine_similarity(doc_term_matrix)
Output
array([[1. , 0.53044716, 0.35999211],
[0.53044716, 1. , 0.35999211],
[0.35999211, 0.35999211, 1. ]])
As you can see cosine_similarity(d1,d2) = 0.53 and cosine_similarity(d1,d3) = 0.35, so indeed d1 and d2 are more similar then d1 and d3 (1 being exactly similar and 0 being not similar - orthogonal vectors).
Once you train your Vectorizer
you can pickle the trained object to a disk for later usage.
tf
of a word is at document level, idf
of a word is at corpus level and tf-idf
of a word is at document with respect to the corpus. They are well suited for vector representation of a document (or a sentence when a document is made up of a single sentence). If you are interested in vector representation of words, then explore word embedding like (word2vec, fasttext, glove etc).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With