Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-Learn Vectorizer `max_features`

Tags:

scikit-learn

How do I choose the number of the max_features parameter in TfidfVectorizer module? Should I use the maximum number of elements in the data?

The description of the parameter does not give me a clear vision of how to choose the value for it:

max_features : int or None, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

like image 471
pierre Avatar asked Sep 08 '17 14:09

pierre


1 Answers

This parameter is absolutely optional and should be calibrated according to the rational thinking and the data structure.

Sometimes it is not effective to transform the whole vocabulary, as the data may have some exceptionally rare words, which, if passed to TfidfVectorizer().fit(), will add unwanted dimensions to inputs in the future. One of the appropriate techniques in this case, for instance, would be to print out word frequences accross documents and then set a certain threshold for them. Imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, you set max_features=80 and you are good to go.

If max_features is set to None, then the whole corpus is considered during the TF-IDF transformation. Otherwise, if you pass, say, 5 to max_features, that would mean creating a feature matrix out of the most 5 frequent words accross text documents.


Quick example

Assume you work with hardware-related documents. Your raw data is the following:

from sklearn.feature_extraction.text import TfidfVectorizer

data = ['gpu processor cpu performance',
        'gpu performance ram computer',
        'cpu computer ram processor jeans']

You see the word jeans in the third document is hardly related and occures only once in the whole dataset. The best way to omit the word, of course, would be to use stop_words parameter, but imagine if there are plenty of such words; or words that are related to the topic but occur scarcely. In the second case, the max_features parameter might help. If you proceed with max_features=None, then it will create a 3x7 sparse matrix, while the best-case scenario would be 3x6 matrix:

tf = TfidfVectorizer(max_features=None).fit(data)
tf.vocabulary_.__len__()  # returns 7 as we passed 7 words
tf.fit_transform(data)  # returns 3x7 sparse matrix

tf = TfidfVectorizer(max_features=6).fit(data)  # excluding 'jeans'
tf.vocabulary_  # prints out every words except 'jeans'
tf.vocabulary_.__len__()  # returns 6
tf.fit_transform(data)  # returns 3x6 sparse matrix
like image 59
E.Z. Avatar answered Sep 21 '22 12:09

E.Z.