SelectKBest based on (estimated) amount of features

Question

I'm trying to implement a hierarchical text classifier with scikit-learn, with one "root" classifier that arranges all input strings in one (or more) of ~50 categories. For each of these categories, I'm gonna train a new classifier, which solves the actual task.

The reason for this two-layer approach is training performance and memory issues (a classifier which is supposed to separate >1k classes does not perform very well...).

This is what my pipeline looks like for each of these "subclassifiers"

pipeline = Pipeline([
    ('vect', CountVectorizer(strip_accents=None, lowercase=True, analyzer='char_wb', ngram_range=(3,8), max_df=0.1)),
    ('tfidf', TfidfTransformer(norm='l2')),
    ('feat', SelectKBest(chi2, k=10000)),
    ('clf', OneVsRestClassifier(SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, n_iter=10))),
])

Now to my problem: I'm using SelectKBest to limit the model size to a reasonable amount, but for the subclassifiers, there is sometimes not enough input data available so I don't even get to the 10k feature limit, which causes

(...)
  File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 300, in fit
    self._check_params(X, y)
  File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 405, in _check_params
    % self.k)
ValueError: k should be >=0, <= n_features; got 10000.Use k='all' to return all features.

I don't know how many features I will have without applying the CountVectorizer, but I have to define the pipeline in advance. My preferred solution would be to skip the SelectKBest step, if there are less than k features anyway, but I don't know how to implement this behaviour without calling CountVectorizer twice (once in advance, once as part of the pipeline).

Any thoughts on this?

klamann · Accepted Answer

I followed the advice of Martin Krämer and created a subclass of SelectKBest which implements the desired functionality:

class SelectAtMostKBest(SelectKBest):

    def _check_params(self, X, y):
        if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
            # set k to "all" (skip feature selection), if less than k features are available
            self.k = "all"

I tried to add this snipped to his answer but the request was rejected so there you are...

SelectKBest based on (estimated) amount of features

Tags:

python

scikit-learn

klamann

1 Answers

klamann

Recent Activity

Donate For Us

SelectKBest based on (estimated) amount of features

Tags:

python

scikit-learn

klamann

1 Answers

klamann

Related questions

Recent Activity

Donate For Us