I have 900 different text files loaded into my console, totaling about 3.5 million words. I'm running the document clustering algorithms seen here, and am running into issues with the TfidfVectorizer
function. Here's what I'm looking at:
from sklearn.feature_extraction.text import TfidfVectorizer
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.4, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
store_matrix = {}
for key,value in speech_dict.items():
tfidf_matrix = tfidf_vectorizer.fit_transform(value) #fit the vectorizer to synopses
store_matrix[key] = tfidf_matrix
This code runs until ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
pops up. However, the code won't quit on error unless I riase max_df
to 0.99
and lower min_df
to 0.01
. Then, it runs seemingly forever, as it's including basically all 3.5 million terms.
How can I get around this?
My text files are stored in speech_dict
, the keys of which are the filenames, and the values of which is the text.
From the documentation, scikit-learn, TF-IDF vectorizer,
max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
Please check the data type of the variable, totalvocab_stemmed_body
. If it is a list, each element of the list is considered as a document.
Case 1: No of documents=20,00,000, min_df=0.5
.
If you have a large number of files (say 2 Million), and each has a few words only, and are from very different domains, there's very less chance that there are terms which are present in minimum, 10,00,000 (20,00,000 * 0.5 ) documents.
Case 2: No of documents=200, max_df=0.95
If you have a set of repeated files (say 200), you will see that the terms are present in most of the documents. With max_df=0.95
, you are telling that those terms which are present in more than 190 files, do not consider them. In this case, all terms are more or less repeated, and your vectorizer won't be able to find out any terms for the matrix.
This is my thought on this topic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With