How to apply standardization to SVMs in scikit-learn?

Tags:

I'm using the current stable version 0.13 of scikit-learn. I'm applying a linear support vector classifier to some data using the class sklearn.svm.LinearSVC.

In the chapter about preprocessing in scikit-learn's documentation, I've read the following:

Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

Question 1: Is standardization useful for SVMs in general, also for those with a linear kernel function as in my case?

Question 2: As far as I understand, I have to compute the mean and standard deviation on the training data and apply this same transformation on the test data using the class sklearn.preprocessing.StandardScaler. However, what I don't understand is whether I have to transform the training data as well or just the test data prior to feeding it to the SVM classifier.

That is, do I have to do this:

scaler = StandardScaler() scaler.fit(X_train)                # only compute mean and std here X_test = scaler.transform(X_test)  # perform standardization by centering and scaling  clf = LinearSVC() clf.fit(X_train, y_train) clf.predict(X_test)

Or do I have to do this:

scaler = StandardScaler() X_train = scaler.fit_transform(X_train)  # compute mean, std and transform training data as well X_test = scaler.transform(X_test)  # same as above  clf = LinearSVC() clf.fit(X_train, y_train) clf.predict(X_test)

In short, do I have to use scaler.fit(X_train) or scaler.fit_transform(X_train) on the training data in order to get reasonable results with LinearSVC?

894

asked Feb 04 '13 14:02

pemistahl

1 Answers

Neither.

scaler.transform(X_train) doesn't have any effect. The transform operation is not in-place. You have to do

X_train = scaler.fit_transform(X_train)  X_test = scaler.transform(X_test)

X_train = scaler.fit(X_train).transform(X_train)

You always need to do the same preprocessing on both training or test data. And yes, standardization is always good if it reflects your believe for the data. In particular for kernel-svms it is often crucial.

185

answered Oct 08 '22 19:10

Andreas Mueller

Related questions
                            
                                TensorFlow Variables and Constants
                            
                                Upload local files using Google Colab
                            
                                What does '@reify' do and when should it be used?
                            
                                Use "byte-like object" from urlopen.read with JSON?
                            
                                Set up Python on Windows to not type "python" in cmd
                            
                                Check if a function returns false in Python
                            
                                How to read python bytecode?
                            
                                What does a dot after an integer mean in python?
                            
                                Row titles for matplotlib subplot
                            
                                How to use jinja2 as a templating engine in Django 1.8
                            
                                simplest python equivalent to R's grepl
                            
                                What's the best way to calculate a 3D (or n-D) centroid?
                            
                                Installing Python Imaging Library (PIL) on Snow Leopard with updated Python 2.6.2
                            
                                How to diff file and output stream "on-the-fly"?
                            
                                numpy: extending arrays along a new axis?
                            
                                Update Tkinter Label from variable
                            
                                sqlalchemy Move mixin columns to end
                            
                                Adding a Log entry for an action by a user in a Django App
                            
                                gem/git-style command line arguments in Python
                            
                                How to use sys.exit() in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to apply standardization to SVMs in scikit-learn?

Tags:

python

classification

svm

scikit-learn

normalization

pemistahl

People also ask

1 Answers

Andreas Mueller

Recent Activity

Donate For Us