How do I choose the number of the max_features
parameter in TfidfVectorizer
module? Should I use the maximum number of elements in the data?
The description of the parameter does not give me a clear vision of how to choose the value for it:
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
This parameter is absolutely optional and should be calibrated according to the rational thinking and the data structure.
Sometimes it is not effective to transform the whole vocabulary, as the data may have some exceptionally rare words, which, if passed to TfidfVectorizer().fit()
, will add unwanted dimensions to inputs in the future. One of the appropriate techniques in this case, for instance, would be to print out word frequences accross documents and then set a certain threshold for them. Imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, you set max_features=80
and you are good to go.
If max_features
is set to None
, then the whole corpus is considered during the TF-IDF transformation. Otherwise, if you pass, say, 5
to max_features
, that would mean creating a feature matrix out of the most 5 frequent words accross text documents.
Assume you work with hardware-related documents. Your raw data is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
data = ['gpu processor cpu performance',
'gpu performance ram computer',
'cpu computer ram processor jeans']
You see the word jeans
in the third document is hardly related and occures only once in the whole dataset. The best way to omit the word, of course, would be to use stop_words
parameter, but imagine if there are plenty of such words; or words that are related to the topic but occur scarcely. In the second case, the max_features
parameter might help. If you proceed with max_features=None
, then it will create a 3x7 sparse matrix, while the best-case scenario would be 3x6 matrix:
tf = TfidfVectorizer(max_features=None).fit(data)
tf.vocabulary_.__len__() # returns 7 as we passed 7 words
tf.fit_transform(data) # returns 3x7 sparse matrix
tf = TfidfVectorizer(max_features=6).fit(data) # excluding 'jeans'
tf.vocabulary_ # prints out every words except 'jeans'
tf.vocabulary_.__len__() # returns 6
tf.fit_transform(data) # returns 3x6 sparse matrix
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With