Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding min_df and max_df in scikit CountVectorizer

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (five text files)?

What are the differences when min_df and max_df are provided as integers or as floats?

The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of these two parameters. Could someone provide an explanation or example demonstrating min_df and max_df?

like image 872
moeabdol Avatar asked Dec 29 '14 23:12

moeabdol


People also ask

What is the use of TfidfVectorizer?

TfidfVectorizer. Convert a collection of raw documents to a matrix of TF-IDF features. Equivalent to CountVectorizer followed by TfidfTransformer .

What is TF-IDF transformer?

Tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

What is Analyzer TfidfVectorizer?

# use analyzer is word and stop_words is english which are responsible for remove stop words and create word vocabularytfidfvectorizer = TfidfVectorizer(analyzer='word' , stop_words='english',)tfidfvectorizer.fit(train) tfidf_train = tfidfvectorizer.transform(train)

Does TF-IDF remove punctuation?

When a bag of words approach, like described above is used, punctuation can be removed as sentence structure and word order is irrelevant when using TF-IDF. Some words of caution though. Punctuation can be vital when doing sentiment analysis or other NLP tasks so understand your requirements.


2 Answers

max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:

  • max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
  • max_df = 25 means "ignore terms that appear in more than 25 documents".

The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.


min_df is used for removing terms that appear too infrequently. For example:

  • min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
  • min_df = 5 means "ignore terms that appear in less than 5 documents".

The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

like image 70
Kevin Markham Avatar answered Sep 27 '22 22:09

Kevin Markham


As per the CountVectorizer documentation here.

When using a float in the range [0.0, 1.0] they refer to the document frequency. That is the percentage of documents that contain the term.

When using an int it refers to absolute number of documents that hold this term.

Consider the example where you have 5 text files (or documents). If you set max_df = 0.6 then that would translate to 0.6*5=3 documents. If you set max_df = 2 then that would simply translate to 2 documents.

The source code example below is copied from Github here and shows how the max_doc_count is constructed from the max_df. The code for min_df is similar and can be found on the GH page.

max_doc_count = (max_df                  if isinstance(max_df, numbers.Integral)                  else max_df * n_doc) 

The defaults for min_df and max_df are 1 and 1.0, respectively. This basically says "If my term is found in only 1 document, then it's ignored. Similarly if it's found in all documents (100% or 1.0) then it's ignored."

max_df and min_df are both used internally to calculate max_doc_count and min_doc_count, the maximum and minimum number of documents that a term must be found in. This is then passed to self._limit_features as the keyword arguments high and low respectively, the docstring for self._limit_features is

"""Remove too rare or too common features.  Prune features that are non zero in more samples than high or less documents than low, modifying the vocabulary, and restricting it to at most the limit most frequent.  This does not prune samples with zero features. """ 
like image 32
Ffisegydd Avatar answered Sep 27 '22 22:09

Ffisegydd