I'm building a linear regression model in sci-kit learn, and am scaling the inputs as a preprocessing step in a sci-kit learn Pipeline. Is there any way I can avoid scaling binary columns? What's happening is that these columns are being scaled with every other column, causing the values to be centered around 0, rather than being 0 or 1, so I'm getting values like [-0.6, 0.3], which cause input values of 0 to influence predictions in my linear model.
Basic code to illustrate:
>>> import numpy as np
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import Ridge
>>> X = np.hstack( (np.random.random((1000, 2)),
np.random.randint(2, size=(1000, 2))) )
>>> X
array([[ 0.30314072, 0.22981496, 1. , 1. ],
[ 0.08373292, 0.66170678, 1. , 0. ],
[ 0.76279599, 0.36658793, 1. , 0. ],
...,
[ 0.81517519, 0.40227095, 0. , 0. ],
[ 0.21244587, 0.34141014, 0. , 0. ],
[ 0.2328417 , 0.14119217, 0. , 0. ]])
>>> scaler = StandardScaler()
>>> scaler.fit_transform(X)
array([[-0.67768374, -0.95108883, 1.00803226, 1.03667198],
[-1.43378124, 0.53576375, 1.00803226, -0.96462528],
[ 0.90632643, -0.48022732, 1.00803226, -0.96462528],
...,
[ 1.08682952, -0.35738315, -0.99203175, -0.96462528],
[-0.99022572, -0.56690563, -0.99203175, -0.96462528],
[-0.91994001, -1.25618613, -0.99203175, -0.96462528]])
I'd love for the output of the last line to be:
>>> scaler.fit_transform(X, dont_scale_binary_or_something=True)
array([[-0.67768374, -0.95108883, 1. , 1. ],
[-1.43378124, 0.53576375, 1. , 0. ],
[ 0.90632643, -0.48022732, 1. , 0. ],
...,
[ 1.08682952, -0.35738315, 0. , 0. ],
[-0.99022572, -0.56690563, 0. , 0. ],
[-0.91994001, -1.25618613, 0. , 0. ]])
Any way I can accomplish this? I suppose I could just select the columns that aren't binary, only transform those, then replace the transformed values back into the array, but I'd like it to play nicely with the sci-kit learn Pipeline workflow, so I can just do something like:
clf = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge())])
clf.set_params(scaler__dont_scale_binary_features=True, ridge__alpha=0.04).fit(X, y)
You should create a custom scaler which ignores the last two columns while scaling.
from sklearn.base import TransformerMixin
import numpy as np
class CustomScaler(TransformerMixin):
def __init__(self):
self.scaler = StandardScaler()
def fit(self, X, y):
self.scaler.fit(X[:, :-2], y)
return self
def transform(self, X):
X_head = self.scaler.transform(X[:, :-2])
return np.concatenate(X_head, X[:, -2:], axis=1)
I'm posting code that I adapted from @miindlek's response just in case it is helpful to others. I encountered an error when I didn't include BaseEstimator. Thank you again @miindlek. Below, bin_vars_index is an array of column indexes for the binary variable and cont_vars_index is the same for the continuous variables that you want to scale.
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class CustomScaler(BaseEstimator,TransformerMixin):
# note: returns the feature matrix with the binary columns ordered first
def __init__(self,bin_vars_index,cont_vars_index,copy=True,with_mean=True,with_std=True):
self.scaler = StandardScaler(copy,with_mean,with_std)
self.bin_vars_index = bin_vars_index
self.cont_vars_index = cont_vars_index
def fit(self, X, y=None):
self.scaler.fit(X[:,self.cont_vars_index], y)
return self
def transform(self, X, y=None, copy=None):
X_tail = self.scaler.transform(X[:,self.cont_vars_index],y,copy)
return np.concatenate((X[:,self.bin_vars_index],X_tail), axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With