Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn: get selected features when using SelectKBest within pipeline

I am trying to do features selection as a part of the a scikit-learn pipeline, on a multi-label scenario. My purpose is to select best K features, for some given k.

It might be simple, but I don't understand how to get the selected features indices in such a scenario.

on a regular scenario I could do something like that:

anova_filter = SelectKBest(f_classif, k=10)

anove_filter.fit_transform(data.X, data.Y)

anova_filter.get_support()

but on a multilabel scenario my labels dimensions are #samples X #unique_labels so fit and fit_transform yield the following exception: ValueError: bad input shape

which makes sense, because it expects labels of dimension [#samples]

on the multilabel scenario, it makes sense to do something like that:

clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),('svm', LinearSVC())])

multiclf = OneVsRestClassifier(clf, n_jobs=-1)

multiclf.fit(data.X, data.Y)

but then the object I'm getting is of type sklearn.multiclass.OneVsRestClassifier which doesn't have a get_support function. How do I get the trained SelectKBest model when it's used during a pipeline?

like image 512
Delli22 Avatar asked Sep 12 '15 20:09

Delli22


People also ask

What is SelectKBest feature selection?

Feature selection for supervised models using SelectKBest Feature selection is a technique where we choose those features in our data that contribute most to the target variable. In other words we choose the best predictors for the target variable.

How does univariate feature selection work?

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method: SelectKBest removes all but the highest scoring features.

What is the benefit of using Scikit-learn pipeline utility for data pre processing?

The Scikit-learn pipeline is a tool that chains all steps of the workflow together for a more streamlined procedure. The key benefit of building a pipeline is improved readability. Pipelines are able to execute a series of transformations with one call, allowing users to attain results with less code.


1 Answers

The way you set it up, there will be one SelectKBest per class. Is that what you intended? You can get them via

multiclf.estimators_[i].named_steps['f_classif'].get_support()

If you want one feature selection for all the OvR models, you can do

clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),
                ('svm', OneVsRestClassifier(LinearSVC()))])

and get the single feature selection with

clf.named_steps['f_classif'].get_support()
like image 97
Andreas Mueller Avatar answered Oct 18 '22 00:10

Andreas Mueller