I am trying to do features selection as a part of the a scikit-learn pipeline, on a multi-label scenario. My purpose is to select best K features, for some given k.
It might be simple, but I don't understand how to get the selected features indices in such a scenario.
on a regular scenario I could do something like that:
anova_filter = SelectKBest(f_classif, k=10)
anove_filter.fit_transform(data.X, data.Y)
anova_filter.get_support()
but on a multilabel scenario my labels dimensions are #samples X #unique_labels so fit and fit_transform yield the following exception: ValueError: bad input shape
which makes sense, because it expects labels of dimension [#samples]
on the multilabel scenario, it makes sense to do something like that:
clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),('svm', LinearSVC())])
multiclf = OneVsRestClassifier(clf, n_jobs=-1)
multiclf.fit(data.X, data.Y)
but then the object I'm getting is of type sklearn.multiclass.OneVsRestClassifier which doesn't have a get_support function. How do I get the trained SelectKBest model when it's used during a pipeline?
Feature selection for supervised models using SelectKBest Feature selection is a technique where we choose those features in our data that contribute most to the target variable. In other words we choose the best predictors for the target variable.
Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method: SelectKBest removes all but the highest scoring features.
The Scikit-learn pipeline is a tool that chains all steps of the workflow together for a more streamlined procedure. The key benefit of building a pipeline is improved readability. Pipelines are able to execute a series of transformations with one call, allowing users to attain results with less code.
The way you set it up, there will be one SelectKBest per class. Is that what you intended? You can get them via
multiclf.estimators_[i].named_steps['f_classif'].get_support()
If you want one feature selection for all the OvR models, you can do
clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),
('svm', OneVsRestClassifier(LinearSVC()))])
and get the single feature selection with
clf.named_steps['f_classif'].get_support()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With