I want to make sure I understand what the attributes use_idf and sublinear_tf do in the TfidfVectorizer object. I've been researching this for a few days. I am trying to classify documents with varied length and use currently tf-idf for feature selection. I believe when <code>use_idf=true</code> the algo normalises the bias against the inherent issue (with TF) where a term that is X times more frequent shouldn't be X times as important. Utilising the <code>tf*idf</code> formula. Then the <code>sublinear_tf = true</code> instills <code>1+log(tf)</code> such that it normalises the bias against lengthy documents vs short documents. I am dealing with an inherently bias towards lengthy documents (most belong to one class), does this normalisation really diminish the bias? How can I make sure the length of the documents in the corpus are not integrated into the model? I'm trying to verify that the normalisation is being applied in the model. I am trying to extract the normalizated vectors of the corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum are greater than 1, I thought a normalized copora would transform all documents to a range between 0-1. <pre class="prettyprint"><code>vect = TfidfVectorizer(max_features=20000, strip_accents='unicode', stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2') tfidf = vect.fit_transform(X_train) # sum norm l2 documents vect_sum = tfidf.sum(axis=1) </code></pre>

Neither <code>use_idf</code> nor <code>sublinear_tf</code> deals with document length. And actually your explanation for <code>use_idf</code> "where a term that is X times more frequent shouldn't be X times as important" is more fitting as a description to <code>sublinear_tf</code> as <code>sublinear_tf</code> causes logarithmic increase in Tfidf score compared to the term frequency. <code>use_idf</code> means to use Inverse Document Frequency, so that terms that appear very frequently to the extent they appear in most document (i.e., a bad indicator) get weighted less compared to terms that appear less frequently but they appear in specific documents only (i.e., a good indicator). To reduce document length bias, you use normalization (<code>norm</code> in TfidfVectorizer parameter) as you proportionally scale each term's Tfidf score based on total score of that document (simple average for <code>norm=l1</code>, squared average for <code>norm=l2</code>) By default, TfidfVectorizer already use <code>norm=l2</code>, though, so I'm not sure what is causing the problem you are facing. Perhaps those longer documents indeed contain similar words also? Also classification often depend a lot on the data, so I can't say much here to solve your problem. References: <ul> <li>TfidfVectorizer documentation</li> <li>Wikipedia</li> <li>Stanford Book</li> </ul>

TfidfVectorizer - Normalisation bias

Tags:

python

scikit-learn

normalization

tf-idf

I want to make sure I understand what the attributes use_idf and sublinear_tf do in the TfidfVectorizer object. I've been researching this for a few days. I am trying to classify documents with varied length and use currently tf-idf for feature selection.

I believe when use_idf=true the algo normalises the bias against the inherent issue (with TF) where a term that is X times more frequent shouldn't be X times as important.

Utilising the tf*idf formula. Then the sublinear_tf = true instills 1+log(tf) such that it normalises the bias against lengthy documents vs short documents.

I am dealing with an inherently bias towards lengthy documents (most belong to one class), does this normalisation really diminish the bias?

How can I make sure the length of the documents in the corpus are not integrated into the model?

I'm trying to verify that the normalisation is being applied in the model. I am trying to extract the normalizated vectors of the corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum are greater than 1, I thought a normalized copora would transform all documents to a range between 0-1.

vect = TfidfVectorizer(max_features=20000, strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')

tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)

279

asked Dec 23 '15 12:12

OAK

2 Answers

use_idf=true (by default) introduces a global component to the term frequency component (local component: individual article). When looking after the similarity of two texts, instead of counting the number of terms that each of them has and compare them, introducing the idf helps categorizing these terms into relevant or not. According to Zipf's law, the "frequency of any word is inversely proportional to its rank". That is, the most common word will appear twice as many times as the second most common word, three times as the third most common word etc. Even after removing stop words, all words are subjected to Zipf's law.

In this sense, imagine you have 5 articles describing a topic of automobiles. In this example the word "auto" will likely to appear in all 5 texts, and therefore will not be a unique identifier of a single text. On the other hand, if only an article describes auto "insurance" while another describes auto "mechanics", these two words ("mechanics" and "insurance") will be a unique identifier of each texts. By using the idf, words that appear less common in a texts ("mechanics" and "insurance" for example) will receive a higher weight. Therefore using an idf does not tackle the bias generated by the length of an article, since is again, a measure of a global component. If you want to reduce the bias generated by length then as you said, using sublinear_tf=True will be a way to solve it since you are transforming the local component (each article).

Hope it helps.

answered Nov 08 '22 15:11

Economist_Ayahuasca

Neither use_idf nor sublinear_tf deals with document length. And actually your explanation for use_idf "where a term that is X times more frequent shouldn't be X times as important" is more fitting as a description to sublinear_tf as sublinear_tf causes logarithmic increase in Tfidf score compared to the term frequency.

use_idf means to use Inverse Document Frequency, so that terms that appear very frequently to the extent they appear in most document (i.e., a bad indicator) get weighted less compared to terms that appear less frequently but they appear in specific documents only (i.e., a good indicator).

To reduce document length bias, you use normalization (norm in TfidfVectorizer parameter) as you proportionally scale each term's Tfidf score based on total score of that document (simple average for norm=l1, squared average for norm=l2)

By default, TfidfVectorizer already use norm=l2, though, so I'm not sure what is causing the problem you are facing. Perhaps those longer documents indeed contain similar words also? Also classification often depend a lot on the data, so I can't say much here to solve your problem.

References:

TfidfVectorizer documentation
Wikipedia
Stanford Book

answered Nov 08 '22 14:11

justhalf

Related questions
                            
                                Python: displaying a line of text outside a matplotlib chart
                            
                                Divide .csv file into chunks with Python
                            
                                What is the Laplacian mask/kernel used in the scipy.ndimage.filter.laplace()?
                            
                                Matplotlib Basemap Coastal Coordinates
                            
                                Why are you never supposed to reload modules? [duplicate]
                            
                                pygraphviz, ImportError: undefined symbol: Agundirected
                            
                                Python regex: Remove a pattern at the end of string
                            
                                How do you get the url from Submission object in PRAW?
                            
                                How to create groupby subplots in Pandas?
                            
                                Generate a list of 6 random numbers between 1 and 6 in python
                            
                                How to tell if a string has exactly 8 1's and 0's in it in python
                            
                                matplotlib conditional background color in python
                            
                                Concatenating numpy vector and matrix horizontally
                            
                                Pandas - How to replace string with zero values in a DataFrame series?
                            
                                Choosing a maximum randomly in the case of a tie?
                            
                                Python - numpy.loadtxt how to ignore end commas?
                            
                                Json dumping bytes fails in Python 3
                            
                                How do I combine two columns within a dataframe in Pandas?
                            
                                How to convert list of dictionaries into list of lists
                            
                                django Google-Oauth authentication error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With