I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:
df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])
But now assume I have column 'C' in df of type string and the following pipeline definition
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('standard', StandardScaler())
])
df_scaled = pipeline.fit_transform(df)
How can I tell StandardScaler to only scale columns A and B?
I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)
Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler
You could check out sklearn-pandas which offers an integration of Pandas DataFrame and sklearn, e.g. with the DataFrameMapper:
mapper = DataFrameMapper([
... (list_of_columnnames, StandardScaler())
... ])
I if you don't need external dependencies, you could use a simple own transformer, as I answered here:
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
pipe = make_pipeline(Columns(names=list_of_columnnames),StandardScaler())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With