Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns

I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:

df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])

But now assume I have column 'C' in df of type string and the following pipeline definition

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline =  Pipeline([
                ('standard', StandardScaler())
            ])

df_scaled = pipeline.fit_transform(df)

How can I tell StandardScaler to only scale columns A and B?

I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:

normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler

like image 763
Romeo Kienzler Avatar asked Feb 04 '23 00:02

Romeo Kienzler


1 Answers

You could check out sklearn-pandas which offers an integration of Pandas DataFrame and sklearn, e.g. with the DataFrameMapper:

mapper = DataFrameMapper([
...     (list_of_columnnames, StandardScaler())
... ])

I if you don't need external dependencies, you could use a simple own transformer, as I answered here:

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

pipe =  make_pipeline(Columns(names=list_of_columnnames),StandardScaler())
like image 57
Marcus V. Avatar answered Feb 06 '23 16:02

Marcus V.