Using sklearn StandardScaler on only select columns

I have a numpy array X that has 3 columns and looks like the following:

array([[    3791,     2629,        0],
       [ 1198760,   113989,        0],
       [ 4120665,        0,        1],
       ...

The first 2 columns are continuous values and the last column is binary (0,1). I would like to apply the StandardScaler class only to the first 2 columns. I am currently doing this the following way:

scaler = StandardScaler()
X_subset = scaler.fit_transform(X[:,[0,1]])
X_last_column = X[:, 2]
X_std = np.concatenate((X_subset, X_last_column[:, np.newaxis]), axis=1)

The output of X_std is then:

array([[-0.34141308, -0.18316715,  0.        ],
       [-0.22171671, -0.17606473,  0.        ],
       [ 0.07096154, -0.18333483,  1.        ],
       ...,

Is there a way to perform this all in one step? I would like to include this as part of a pipeline where it will scale the first 2 columns and leave the last binary column as is.

What does Sklearn StandardScaler do?

StandardScaler. StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.

Since scikit-learn version 0.20 you can use the function sklearn.compose.ColumnTransformer exactly for this purpose.

I ended up using a class to select columns like this:

class ItemSelector(BaseEstimator, TransformerMixin):

    def __init__(self, columns):
        self.columns = columns

    def fit(self, x, y=None):
        return self

    def transform(self, data_array):
        return data_array[:, self.columns]

I then used FeatureUnion in my pipeline as follows to fit StandardScaler only to continuous variables:

FeatureUnion(
    transformer_list=[
        ('continous', Pipeline([  # Scale the first 2 numeric columns
            ('selector', ItemSelector(columns=[0, 1])),
            ('scaler', StandardScaler())
        ])),
        ('categorical', Pipeline([  # Leave the last binary column as is
            ('selector', ItemSelector(columns=[2]))
        ]))
    ]
)

This worked well for me.

Using sklearn StandardScaler on only select columns

Tags:

python

dataset

scikit-learn

billypilgrim

People also ask

2 Answers

00schneider

billypilgrim

Recent Activity

Donate For Us

Using sklearn StandardScaler on only select columns

Tags:

python

dataset

scikit-learn

billypilgrim

People also ask

2 Answers

00schneider

billypilgrim

Related questions

Recent Activity

Donate For Us