Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SelectKBest based on (estimated) amount of features

I'm trying to implement a hierarchical text classifier with scikit-learn, with one "root" classifier that arranges all input strings in one (or more) of ~50 categories. For each of these categories, I'm gonna train a new classifier, which solves the actual task.

The reason for this two-layer approach is training performance and memory issues (a classifier which is supposed to separate >1k classes does not perform very well...).

This is what my pipeline looks like for each of these "subclassifiers"

pipeline = Pipeline([
    ('vect', CountVectorizer(strip_accents=None, lowercase=True, analyzer='char_wb', ngram_range=(3,8), max_df=0.1)),
    ('tfidf', TfidfTransformer(norm='l2')),
    ('feat', SelectKBest(chi2, k=10000)),
    ('clf', OneVsRestClassifier(SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, n_iter=10))),
])

Now to my problem: I'm using SelectKBest to limit the model size to a reasonable amount, but for the subclassifiers, there is sometimes not enough input data available so I don't even get to the 10k feature limit, which causes

(...)
  File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 300, in fit
    self._check_params(X, y)
  File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 405, in _check_params
    % self.k)
ValueError: k should be >=0, <= n_features; got 10000.Use k='all' to return all features.

I don't know how many features I will have without applying the CountVectorizer, but I have to define the pipeline in advance. My preferred solution would be to skip the SelectKBest step, if there are less than k features anyway, but I don't know how to implement this behaviour without calling CountVectorizer twice (once in advance, once as part of the pipeline).

Any thoughts on this?

like image 778
klamann Avatar asked Dec 25 '22 21:12

klamann


1 Answers

I followed the advice of Martin Krämer and created a subclass of SelectKBest which implements the desired functionality:

class SelectAtMostKBest(SelectKBest):

    def _check_params(self, X, y):
        if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
            # set k to "all" (skip feature selection), if less than k features are available
            self.k = "all"

I tried to add this snipped to his answer but the request was rejected so there you are...

like image 115
klamann Avatar answered Dec 27 '22 12:12

klamann