Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns

Question

I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:

df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])

But now assume I have column 'C' in df of type string and the following pipeline definition

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline =  Pipeline([
                ('standard', StandardScaler())
            ])

df_scaled = pipeline.fit_transform(df)

How can I tell StandardScaler to only scale columns A and B?

I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:

normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler

Marcus V. · Accepted Answer

You could check out sklearn-pandas which offers an integration of Pandas DataFrame and sklearn, e.g. with the DataFrameMapper:

mapper = DataFrameMapper([
...     (list_of_columnnames, StandardScaler())
... ])

I if you don't need external dependencies, you could use a simple own transformer, as I answered here:

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

pipe =  make_pipeline(Columns(names=list_of_columnnames),StandardScaler())

Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns

Tags:

python

pandas

scikit-learn

Romeo Kienzler

1 Answers

Marcus V.

Recent Activity

Donate For Us

Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns

Tags:

python

pandas

scikit-learn

Romeo Kienzler

1 Answers

Marcus V.

Related questions

Recent Activity

Donate For Us