I have a numpy array X that has 3 columns and looks like the following:
array([[ 3791, 2629, 0],
[ 1198760, 113989, 0],
[ 4120665, 0, 1],
...
The first 2 columns are continuous values and the last column is binary (0,1). I would like to apply the StandardScaler class only to the first 2 columns. I am currently doing this the following way:
scaler = StandardScaler()
X_subset = scaler.fit_transform(X[:,[0,1]])
X_last_column = X[:, 2]
X_std = np.concatenate((X_subset, X_last_column[:, np.newaxis]), axis=1)
The output of X_std is then:
array([[-0.34141308, -0.18316715, 0. ],
[-0.22171671, -0.17606473, 0. ],
[ 0.07096154, -0.18333483, 1. ],
...,
Is there a way to perform this all in one step? I would like to include this as part of a pipeline where it will scale the first 2 columns and leave the last binary column as is.
StandardScaler. StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.
Since scikit-learn version 0.20 you can use the function sklearn.compose.ColumnTransformer exactly for this purpose.
I ended up using a class to select columns like this:
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, x, y=None):
return self
def transform(self, data_array):
return data_array[:, self.columns]
I then used FeatureUnion in my pipeline as follows to fit StandardScaler only to continuous variables:
FeatureUnion(
transformer_list=[
('continous', Pipeline([ # Scale the first 2 numeric columns
('selector', ItemSelector(columns=[0, 1])),
('scaler', StandardScaler())
])),
('categorical', Pipeline([ # Leave the last binary column as is
('selector', ItemSelector(columns=[2]))
]))
]
)
This worked well for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With