Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK Document clustering: no terms remain after pruning?

I have 900 different text files loaded into my console, totaling about 3.5 million words. I'm running the document clustering algorithms seen here, and am running into issues with the TfidfVectorizer function. Here's what I'm looking at:

from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.4, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

store_matrix = {}
for key,value in speech_dict.items():
    tfidf_matrix = tfidf_vectorizer.fit_transform(value) #fit the vectorizer to synopses
    store_matrix[key] = tfidf_matrix

This code runs until ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df. pops up. However, the code won't quit on error unless I riase max_df to 0.99 and lower min_df to 0.01. Then, it runs seemingly forever, as it's including basically all 3.5 million terms.

How can I get around this?

My text files are stored in speech_dict, the keys of which are the filenames, and the values of which is the text.

like image 683
blacksite Avatar asked Oct 31 '22 15:10

blacksite


1 Answers

From the documentation, scikit-learn, TF-IDF vectorizer,

max_df : float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Please check the data type of the variable, totalvocab_stemmed_body . If it is a list, each element of the list is considered as a document.

Case 1: No of documents=20,00,000, min_df=0.5.

If you have a large number of files (say 2 Million), and each has a few words only, and are from very different domains, there's very less chance that there are terms which are present in minimum, 10,00,000 (20,00,000 * 0.5 ) documents.

Case 2: No of documents=200, max_df=0.95

If you have a set of repeated files (say 200), you will see that the terms are present in most of the documents. With max_df=0.95, you are telling that those terms which are present in more than 190 files, do not consider them. In this case, all terms are more or less repeated, and your vectorizer won't be able to find out any terms for the matrix.

This is my thought on this topic.

like image 133
pnv Avatar answered Nov 09 '22 15:11

pnv