Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn preprocessing SVM with multiple classes in a pipeline

The literature on machine learning strongly suggests normalization of data for SVM (Preprocessing data in scikit-learn). And as answered before, same StandardScalar should be applied to both training and test data.

  1. What is the advantages of using StandardScalar over manually subtracting the mean and dividing by standard deviation (other than the ability to use it in a pipeline)?
  2. LinearSVC in scikit-learn depends on one-vs-the-rest for multiple classes (as larsmans mentioned, SVC depends on one-vs-one for multi-class). So what would happen if I have multiple classes trained with a pipeline with normalization as the first estimator? Would it also calculate the mean and standard variation of the each class, and use it during classification?
  3. To be more specific, does the following classifier apply different mean and standard deviations to each class before svm stage of pipeline?
estimators = [('normalize', StandardScaler()), ('svm', SVC(class_weight = 'auto'))]
clf = Pipeline(estimators)
# Training
clf.fit(X_train, y)
# Classification
clf.predict(X_test)
like image 624
dashesy Avatar asked Apr 22 '13 00:04

dashesy


1 Answers

The feature scaling performed by StandardScaler is performed without reference to the target classes. It only considers the X feature matrix. It calculates the mean and standard deviation of each feature across all samples, irrespective of the target class of each sample.

Each component of the pipeline operates independently: only the data is passed between them. Let's expand the pipeline's clf.fit(X_train, y). It roughly does the following:

X_train_scaled = clf.named_steps['normalize'].fit_transform(X_train, y)
clf.named_steps['svm'].fit(X_train_scaled, y)

The first scaling step actually ignores the y it is passed, but calculates the mean and standard deviation of each feature in X_train and stores them in its mean_ and std_ attributes (the fit component). It also centers X_train and returns it (the transform component). The next step learns an SVM model, and does what is necessary for one-vs-rest.

Now the pipeline's perspective for classification. clf.predict(X_test) expands to:

X_test_scaled = clf.named_steps['normalize'].transform(X_test)
y_pred = clf.named_steps['svm'].predict(X_test_scaled)

returning y_pred. In the first line it uses the stored mean_ and std_ to apply the transformation to X_test using parameters learnt from the training data.

Yes, the scaling algorithm isn't very complicated. It just subtracts the mean and divides by the std. But StandardScalar:

  • provides a name to the algorithm so you can pull it out of the library
  • avoids you rolling your own, ensuring it works correctly, and not requiring you to understand what it's doing on the inside
  • remembers the parameters from a fit or fit_transform for later transform operations (as above)
  • provides the same interface as other data transformations (and hence can be used in a pipeline)
  • operates over dense or sparse matrices
  • is able to reverse the transformation with its inverse_transform method
like image 185
joeln Avatar answered Sep 28 '22 09:09

joeln