Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select only few columns in scikit learn column selector pipeline?

I was reading the scikitlearn tutorial about column transformer. The given example (https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) works, but when I tried to select only few columns, It gives me error.

MWE

import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector

df = sns.load_dataset('tips')
mycols = ['tip','sex']


ct = make_column_transformer(make_column_selector(pattern=mycols)
ct.fit_transform(df)

Required

I want only the select columns in the output.

NOTE
Of course, I know I can do df[mycols], I am looking for scikit learn pipeline example.

like image 982
BhishanPoudel Avatar asked Jun 16 '20 19:06

BhishanPoudel


People also ask

How does scikit learn pipeline feature selection work in Python?

In this section, we will learn how scikit learn pipeline feature selection works in python. Feature selection is defined as a method to select the features or repeatedly select the features of the pipeline. In the following code, we will import some libraries from which we can select the feature of the pipeline.

Why we do feature selection in scikit learn?

Feature Selection with SelectKBest in Scikit Learn. In this post, you will learn how to do feature selection with SelectKBest in scikit Learn. Why we do Feature Selection ? 1 . Getting more interpretable model 2 . Faster prediction and training 3 . Less storage for model and data How to do Feature Selection with SelectKBest?

How to use make_column_selector with columntransformer?

Create a callable to select columns to be used with ColumnTransformer. make_column_selector can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected. Name of columns containing this regex pattern will be included.

How to select the features of the pipeline?

Feature selection is defined as a method to select the features or repeatedly select the features of the pipeline. In the following code, we will import some libraries from which we can select the feature of the pipeline. x, y = make_classification () is used to make classification.


2 Answers

If you don't mind mlxtend, it has built-in transformer for that.

Using mlxtend

from mlxtend.feature_selection import ColumnSelector

pipe = ColumnSelector(mycols)
pipe.fit_transform(df)

For sklearn >= 0.20

  • Reference: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import seaborn as sns

df = sns.load_dataset('tips')
mycols = ['tip','sex']

pipeline = Pipeline([
    ("selector", ColumnTransformer([
        ("selector", "passthrough", mycols)
    ], remainder="drop"))
])

pipeline.fit_transform(df)

For sklearn < 0.20

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.columns]


pipeline = Pipeline([('selector', FeatureSelector(columns=mycols))
                     ])

pipeline.fit_transform(df)[:5]
like image 166
BhishanPoudel Avatar answered Oct 14 '22 01:10

BhishanPoudel


I'm maybe a bit late, but you can also select columns using sklearn's ColumnTranformer() by setting the transformer to "passthrough" and remainder='drop':

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


pipe = Pipeline([
    ("selector", ColumnTransformer([
        ("selector", "passthrough", mycols)
    ], remainder="drop"))
])
like image 5
Jens Avatar answered Oct 13 '22 23:10

Jens