I have five text files that I input to a CountVectorizer. When specifying <code>min_df</code> and <code>max_df</code> to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (five text files)? What are the differences when <code>min_df</code> and <code>max_df</code> are provided as integers or as floats? The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of these two parameters. Could someone provide an explanation or example demonstrating <code>min_df</code> and <code>max_df</code>?

As per the <code>CountVectorizer</code> documentation here. When using a float in the range <code>[0.0, 1.0]</code> they refer to the document frequency. That is the percentage of documents that contain the term. When using an int it refers to absolute number of documents that hold this term. Consider the example where you have 5 text files (or documents). If you set <code>max_df = 0.6</code> then that would translate to <code>0.6*5=3</code> documents. If you set <code>max_df = 2</code> then that would simply translate to 2 documents. The source code example below is copied from Github here and shows how the <code>max_doc_count</code> is constructed from the <code>max_df</code>. The code for <code>min_df</code> is similar and can be found on the GH page. <pre class="prettyprint"><code>max_doc_count = (max_df if isinstance(max_df, numbers.Integral) else max_df * n_doc) </code></pre> The defaults for <code>min_df</code> and <code>max_df</code> are 1 and 1.0, respectively. This basically says "If my term is found in only 1 document, then it's ignored. Similarly if it's found in all documents (100% or 1.0) then it's ignored." <code>max_df</code> and <code>min_df</code> are both used internally to calculate <code>max_doc_count</code> and <code>min_doc_count</code>, the maximum and minimum number of documents that a term must be found in. This is then passed to <code>self._limit_features</code> as the keyword arguments <code>high</code> and <code>low</code> respectively, the docstring for <code>self._limit_features</code> is <pre class="prettyprint"><code>"""Remove too rare or too common features. Prune features that are non zero in more samples than high or less documents than low, modifying the vocabulary, and restricting it to at most the limit most frequent. This does not prune samples with zero features. """ </code></pre>

Understanding min_df and max_df in scikit CountVectorizer

Tags:

python

machine-learning

nlp

scikit-learn

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (five text files)?

What are the differences when min_df and max_df are provided as integers or as floats?

The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of these two parameters. Could someone provide an explanation or example demonstrating min_df and max_df?

872

asked Dec 29 '14 23:12

moeabdol

2 Answers

max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:

max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
max_df = 25 means "ignore terms that appear in more than 25 documents".

The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.

min_df is used for removing terms that appear too infrequently. For example:

min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
min_df = 5 means "ignore terms that appear in less than 5 documents".

The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

answered Sep 27 '22 22:09

Kevin Markham

As per the CountVectorizer documentation here.

When using a float in the range [0.0, 1.0] they refer to the document frequency. That is the percentage of documents that contain the term.

When using an int it refers to absolute number of documents that hold this term.

Consider the example where you have 5 text files (or documents). If you set max_df = 0.6 then that would translate to 0.6*5=3 documents. If you set max_df = 2 then that would simply translate to 2 documents.

The source code example below is copied from Github here and shows how the max_doc_count is constructed from the max_df. The code for min_df is similar and can be found on the GH page.

max_doc_count = (max_df                  if isinstance(max_df, numbers.Integral)                  else max_df * n_doc)

The defaults for min_df and max_df are 1 and 1.0, respectively. This basically says "If my term is found in only 1 document, then it's ignored. Similarly if it's found in all documents (100% or 1.0) then it's ignored."

max_df and min_df are both used internally to calculate max_doc_count and min_doc_count, the maximum and minimum number of documents that a term must be found in. This is then passed to self._limit_features as the keyword arguments high and low respectively, the docstring for self._limit_features is

"""Remove too rare or too common features.  Prune features that are non zero in more samples than high or less documents than low, modifying the vocabulary, and restricting it to at most the limit most frequent.  This does not prune samples with zero features. """

answered Sep 27 '22 22:09

Ffisegydd

Related questions
                            
                                Split Python Flask app into multiple files
                            
                                Django - No such table: main.auth_user__old
                            
                                How does IPython's magic %paste work?
                            
                                is there a pythonic way to try something up to a maximum number of times? [duplicate]
                            
                                How to write UTF-8 in a CSV file
                            
                                Determining if root logger is set to DEBUG level in Python?
                            
                                drop into python interpreter while executing function
                            
                                Extract images from PDF without resampling, in python?
                            
                                How to draw a rectangle around a region of interest in python
                            
                                Downloading and unzipping a .zip file without writing to disk
                            
                                One liner: creating a dictionary from list with indices as keys
                            
                                Joining multiple strings if they are not empty in Python
                            
                                How can I remove the ANSI escape sequences from a string in python
                            
                                Django CSRF Cookie Not Set
                            
                                "python" not recognized as a command
                            
                                Installing lxml module in python
                            
                                How to implement virtual methods in Python?
                            
                                Efficiently generate a 16-character, alphanumeric string
                            
                                Why is '+' not understood by Python sets?
                            
                                How to get the difference between two dictionaries in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With