The literature on machine learning strongly suggests normalization of data for SVM (Preprocessing data in scikit-learn). And as answered before, same StandardScalar should be applied to both training and test data.
estimators = [('normalize', StandardScaler()), ('svm', SVC(class_weight = 'auto'))]
clf = Pipeline(estimators)
# Training
clf.fit(X_train, y)
# Classification
clf.predict(X_test)
The feature scaling performed by StandardScaler
is performed without reference to the target classes. It only considers the X
feature matrix. It calculates the mean and standard deviation of each feature across all samples, irrespective of the target class of each sample.
Each component of the pipeline operates independently: only the data is passed between them. Let's expand the pipeline's clf.fit(X_train, y)
. It roughly does the following:
X_train_scaled = clf.named_steps['normalize'].fit_transform(X_train, y)
clf.named_steps['svm'].fit(X_train_scaled, y)
The first scaling step actually ignores the y
it is passed, but calculates the mean and standard deviation of each feature in X_train
and stores them in its mean_
and std_
attributes (the fit
component). It also centers X_train
and returns it (the transform
component). The next step learns an SVM model, and does what is necessary for one-vs-rest.
Now the pipeline's perspective for classification. clf.predict(X_test)
expands to:
X_test_scaled = clf.named_steps['normalize'].transform(X_test)
y_pred = clf.named_steps['svm'].predict(X_test_scaled)
returning y_pred
. In the first line it uses the stored mean_
and std_
to apply the transformation to X_test
using parameters learnt from the training data.
Yes, the scaling algorithm isn't very complicated. It just subtracts the mean and divides by the std. But StandardScalar
:
fit
or fit_transform
for later transform
operations (as above)inverse_transform
methodIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With