Calculate TD-IDF for a single word in Textacy

Tags:

I'm trying to use Textacy to calculate the TF-IDF score for a single word across the standard corpus, but am a bit unclear about the result I am receiving.

I was expecting a single float which represented the frequency of the word in the corpus. So why am I receiving a list (?) of 7 results?

"acculer" is actually a French word, so was expecting a result of 0 from an English corpus.

Click to copy

word = 'acculer'
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
tf_idf = vectorizer.fit_transform(word)
logger.info("tf_idf:")
logger.info(tfidf)

Output

Click to copy

tf_idf:
(0, 0)  2.386294361119891
(1, 1)  1.9808292530117262
(2, 1)  1.9808292530117262
(3, 5)  2.386294361119891
(4, 3)  2.386294361119891
(5, 2)  2.386294361119891
(6, 4)  2.386294361119891

The second part of the question is how can I provide my own corpus to the TF-IDF function in Textacy, esp. one in a different language?

EDIT

As mentioned by @Vishal I have logged the ouput using this line:

Click to copy

logger.info(vectorizer.vocabulary_terms)

It seems the provided word acculer has been split into characters.

Click to copy

{'a': 0, 'c': 1, 'u': 5, 'l': 3, 'e': 2, 'r': 4}

(1) How can I get the TF-IDF for this word against the corpus, rather than each character?

(2) How can I provide my own corpus and point to it as a param?

(3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.

837

asked Apr 19 '19 16:04

port5432

1 Answers

Fundamentals

Lets get definitions clear before looking into the actual questions.

Assume our corpus contains 3 documents (d1, d2 and d3 respectively):

Click to copy

corpus = ["this is a red apple", "this is a green apple", "this is a cat"]

Term Frequency (tf)

tf (of a word) is defined as number of times a word appears in a document.

Click to copy

tf(word, document) = count(word, document) # Number of times word appears in the document

tf is defined for a word at document level.

Click to copy

tf('a',d1)     = 1      tf('a',d2)     = 1      tf('a',d3)     = 1
tf('apple',d1) = 1      tf('apple',d2) = 1      tf('apple',d3) = 0
tf('cat',d1)   = 0      tf('cat',d2)   = 0      tf('cat',d3)   = 1
tf('green',d1) = 0      tf('green',d2) = 1      tf('green',d3) = 0
tf('is',d1)    = 1      tf('is',d2)    = 1      tf('is',d3)    = 1
tf('red',d1)   = 1      tf('red',d2)   = 0      tf('red',d3)   = 0
tf('this',d1)  = 1      tf('this',d2)  = 1      tf('this',d3)  = 1

Using the raw counts has a problem that the tf values of words in longer documents have high values compared to the shorter document. This problem can be solved by normalizing the raw count values by dividing by the document length (number of words in the corresponding document). This is called l1 normalization. The document d1 can now be represented by the tf vector with all tf values of all the words in the vocubulary of the corpus. There is an another kind of normalizaiton called l2 which makes the l2 norm of the tf vector of the document equal to 1.

Click to copy

tf(word, document, normalize='l1') = count(word, document)/|document|
tf(word, document, normalize='l2') = count(word, document)/l2_norm(document)

Click to copy

|d1| = 5, |d2| = 5, |d3| = 4
l2_norm(d1) = 0.447, l2_norm(d2) = 0.447, l2_norm(d3) = 0.5,

Code : tf

Click to copy

corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
# Convert docs to textacy format
textacy_docs = [textacy.Doc(doc) for doc in corpus]

for norm in [None, 'l1', 'l2']:
    # tokenize the documents
    tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

    # Fit the tf matrix 
    vectorizer = textacy.Vectorizer(apply_idf=False, norm=norm)
    doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

    print ("\nVocabulary: ", vectorizer.vocabulary_terms)
    print ("TF with {0} normalize".format(norm))
    print (doc_term_matrix.toarray())

Output:

Click to copy

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with None normalize
[[1 1 0 0 1 1 1]
 [1 1 0 1 1 0 1]
 [1 0 1 0 1 0 1]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l1 normalize
[[0.2  0.2  0.   0.   0.2  0.2  0.2 ]
 [0.2  0.2  0.   0.2  0.2  0.   0.2 ]
 [0.25 0.   0.25 0.   0.25 0.   0.25]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l2 normalize
[[0.4472136 0.4472136 0.        0.        0.4472136 0.4472136 0.4472136]
 [0.4472136 0.4472136 0.        0.4472136 0.4472136 0.        0.4472136]
 [0.5       0.        0.5       0.        0.5       0.        0.5      ]]

The rows in the tf matrix correspond to documents (hence 3 rows for our corpus) and columns correspond to each word in the vocabulary (index of the word shown in the vocabulary dictionary)

Inverse Document Frequency (idf)

Some words convey less information then others. For example words like the, a, an, this, that are very common words and they convey very less information. idf is a measure of the importance of the word. We consider a word appearing in many documents to be less informative compared to words appearing in few documents.

Click to copy

idf(word, corpus) = log(|corpus| / No:of documents containing word) + 1  # standard idf

For our corpus intuitively idf(apple, corpus) < idf(cat,corpus)

Click to copy

idf('apple', corpus) = log(3/2) + 1 = 1.405 
idf('cat', corpus) = log(3/1) + 1 = 2.098
idf('this', corpus) = log(3/3) + 1 = 1.0

Code : idf

Click to copy

textacy_docs = [textacy.Doc(doc) for doc in corpus]    
tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

vectorizer = textacy.Vectorizer(apply_idf=False, norm=None)
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("standard idf: ")
print (textacy.vsm.matrix_utils.get_inverse_doc_freqs(doc_term_matrix, type_='standard'))

Output:

Click to copy

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
standard idf: 
[1.     1.405       2.098       2.098       1.      2.098       1.]

Term Frequency–Inverse Document Frequency(tf-idf): tf-idf is a a measure of how important a word is in the document in a corpus. tf of word weighted with its ids gives us the tf-idf measure of the word.

Click to copy

tf-idf(word, document, corpus) = tf(word, docuemnt) * idf(word, corpus)

Click to copy

tf-idf('apple', 'd1', corpus) = tf('apple', 'd1') * idf('apple', corpus) = 1 * 1.405 = 1.405
tf-idf('cat', 'd3', corpus) = tf('cat', 'd3') * idf('cat', corpus) = 1 * 2.098 = 2.098

Code : tf-idf

Click to copy

textacy_docs = [textacy.Doc(doc) for doc in corpus]

tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("tf-idf: ")

vectorizer = textacy.Vectorizer(apply_idf=True, norm=None, idf_type='standard')
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print (doc_term_matrix.toarray())

Output:

Click to copy

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
tf-idf: 
[[1.         1.405   0.         0.         1.         2.098   1.        ]
 [1.         1.405   0.         2.098      1.         0.      1.        ]
 [1.         0.      2.098      0.         1.         0.      1.        ]]

Now coming to the questions:

(1) How can I get the TF-IDF for this word against the corpus, rather than each character?

As seen above, there is no tf-idf defined independently, tf-idf of a word is with respect to a document in a corpus.

(2) How can I provide my own corpus and point to it as a param?

It is shown in the above samples.

Convert the text documents into textacy Docs using textacy.Doc api
Tokenzie the textacy.Doc's using to_terms_list method. (Using this method you can use add unigram, bigram or trigram into the vocabulary, filter out stop wordsm noramalize text etc)
Use textacy.Vectorizer to create the term matrix from the tokenized documents. The term matrix returned is
- tf (raw counts): apply_idf=False, norm=None
- tf (l1 normalized): apply_idf=False, norm='l1'
- tf (l2 normalized): apply_idf=False, norm='l2'
- tf-idf (standard): apply_idf=True, idf_type='standard'

(3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.

Yes you can, if and only if you treat each sentence as a separate document. In such a case the tf-idf vector (full row) of the corresponding document can be treated as a vector representation of the document (which is a single sentence in your case).

In case of our corpus (which infact contains a single sentence per document), the vector representation of d1 and d2 should be close as compared to vectors d1 and d3. Lets check cosin similarity and see :

Click to copy

cosine_similarity(doc_term_matrix)

Output

Click to copy

array([[1.        ,     0.53044716,     0.35999211],
       [0.53044716,     1.        ,     0.35999211],
       [0.35999211,     0.35999211,     1.        ]])

As you can see cosine_similarity(d1,d2) = 0.53 and cosine_similarity(d1,d3) = 0.35, so indeed d1 and d2 are more similar then d1 and d3 (1 being exactly similar and 0 being not similar - orthogonal vectors).

Once you train your Vectorizer you can pickle the trained object to a disk for later usage.

Conclusion

tf of a word is at document level, idfof a word is at corpus level and tf-idf of a word is at document with respect to the corpus. They are well suited for vector representation of a document (or a sentence when a document is made up of a single sentence). If you are interested in vector representation of words, then explore word embedding like (word2vec, fasttext, glove etc).

160

answered Sep 22 '22 16:09

mujjiga

Related questions
                            
                                Get inner-most elements from triple nested list Python
                            
                                zip()-like built-in function filling unequal lengths from left with None value
                            
                                Pandas - Insert blank row for each group in pandas
                            
                                Can't import google.cloud.vision
                            
                                iter() returned non-iterator of type 'dict_items'
                            
                                OpenCV Python Scripts Mac "aborts"
                            
                                How to convert all columns in Pandas DataFrame to 'object' while ignoring NaN?
                            
                                Replacing empty values in a DataFrame with value of a column
                            
                                Resize image faster in OpenCV Python
                            
                                How can I test a python private method (yes I do have reason to test them)
                            
                                Compare two adjacent elements in same list
                            
                                What's the difference between df.head() and df.head?
                            
                                Parse Robot Framework's output xml
                            
                                Dataframe pandas how to pass list as columns
                            
                                dataframe KeyError, although it exists
                            
                                I need to merge elements of sublist in python
                            
                                Python: '{0.lower()}'.format('A') yields 'str' object has no attribute 'lower()'
                            
                                How to shorten appending to different lists depending on the outcome of if statement
                            
                                How to georeference an unreferenced aerial image using ground control points in python
                            
                                Accessing the contents of a public Google Sheet as CSV using Requests (or other library)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate TD-IDF for a single word in Textacy

Tags:

python

machine-learning

nlp

spacy

textacy

port5432

People also ask

1 Answers

Fundamentals

Term Frequency (tf)

Inverse Document Frequency (idf)

Now coming to the questions:

Conclusion

mujjiga

Recent Activity

Donate For Us