Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, max_features=200000,
                             min_df=.5, stop_words='english',
                             use_idf=True,sublinear_tf=True,tokenizer = tokenize_and_stem_body,ngram_range=(1,3))
tfidf_matrix_body = tfidf_vectorizer.fit_transform(totalvocab_stemmed_body)

The above code gives me the error

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.

Can anyone help me out on the same and I have change all value 80 to 100 but issue remain same?

like image 675
Jeet Dadhich Avatar asked Jun 14 '16 15:06

Jeet Dadhich


1 Answers

From the documentation, scikit-learn, TF-IDF vectorizer,

max_df : float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Please check the data type of the variable, totalvocab_stemmed_body . If it is a list, each element of the list is considered as a document.

Case 1: No of documents=20,00,000, min_df=0.5.

If you have a large number of files (say 2 Million), and each has a few words only, and are from very different domains, there's very less chance that there are terms which are present in minimum, 10,00,000 (20,00,000 * 0.5 ) documents.

Case 2: No of documents=200, max_df=0.95

If you have a set of repeated files (say 200), you will see that the terms are present in most of the documents. With max_df=0.95, you are telling that those terms which are present in more than 190 files, do not consider them. In this case, all terms are more or less repeated, and your vectorizer won't be able to find out any terms for the matrix.

This is my thought on this topic.

like image 149
pnv Avatar answered Nov 06 '22 19:11

pnv