I am working with scikit-learn and seeking for a transformer that allows me to simply select which columns to keep or which columns to drop.
In practice, I would like to include in my pipeline an additional transformer step that allows me to choose which columns to keep or which to drop. I am aware that in below example I could simply use the remainder but that would not work in my real implementation where I need to parametrize column selection in order to easily apply it to both train, test and eventually scoring.
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn import preprocessing
prep_pipeline = ColumnTransformer(transformers=[("std_num", preprocessing.StandardScaler(), ["a", "b"])],
remainder = "passthrough")
X = pd.DataFrame([[0., 1., 2., 2.],
[1., 1., 0., 1.]])
X.columns = ["a", "b", "c", "d"]
prep_pipeline.fit_transform(X)
The solution I need pipe an additional transformer step which role is exclusively to selected column ["a", "d"] therefore the expected output is:
array([[-1., 1.],
[ 1., -1.]])
I think you should use Pipeline of sklearn and following class in that Pipeline(current StandardScaler not support scaling parts of data frame)
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class DropSomeColumns(BaseEstimator, TransformerMixin):
def __init__(self, cols):
if not isinstance(cols, list):
self.cols = [cols]
else:
self.cols = cols
def fit(self, X: pd.DataFrame, y: pd.Series):
# there is nothing to fit
return self
def transform(self, X:pd.DataFrame):
X = X.copy()
return X[self.cols]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With