I'm trying to implement a hierarchical text classifier with scikit-learn, with one "root" classifier that arranges all input strings in one (or more) of ~50 categories. For each of these categories, I'm gonna train a new classifier, which solves the actual task.
The reason for this two-layer approach is training performance and memory issues (a classifier which is supposed to separate >1k classes does not perform very well...).
This is what my pipeline looks like for each of these "subclassifiers"
pipeline = Pipeline([
('vect', CountVectorizer(strip_accents=None, lowercase=True, analyzer='char_wb', ngram_range=(3,8), max_df=0.1)),
('tfidf', TfidfTransformer(norm='l2')),
('feat', SelectKBest(chi2, k=10000)),
('clf', OneVsRestClassifier(SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, n_iter=10))),
])
Now to my problem: I'm using SelectKBest
to limit the model size to a reasonable amount, but for the subclassifiers, there is sometimes not enough input data available so I don't even get to the 10k feature limit, which causes
(...)
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 300, in fit
self._check_params(X, y)
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 405, in _check_params
% self.k)
ValueError: k should be >=0, <= n_features; got 10000.Use k='all' to return all features.
I don't know how many features I will have without applying the CountVectorizer
, but I have to define the pipeline in advance. My preferred solution would be to skip the SelectKBest
step, if there are less than k
features anyway, but I don't know how to implement this behaviour without calling CountVectorizer
twice (once in advance, once as part of the pipeline).
Any thoughts on this?
I followed the advice of Martin Krämer and created a subclass of SelectKBest
which implements the desired functionality:
class SelectAtMostKBest(SelectKBest):
def _check_params(self, X, y):
if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
# set k to "all" (skip feature selection), if less than k features are available
self.k = "all"
I tried to add this snipped to his answer but the request was rejected so there you are...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With