Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scaling data in scikit-learn SVM

While libsvm provides tools for scaling data, with Scikit-Learn (which should be based upon libSVM for the SVC classifier) I find no way to scale my data.

Basically I want to use 4 features, of which 3 range from 0 to 1 and the last one is a "big" highly variable number.

If I include the fourth feature in libSVM (using the easy.py script which scales my data automatically) I get some very nice results (96% accuracy). If I include the fourth variable in Scikit-Learn the accuracy drops to ~78% - but if I exclude it, I get the same results I get in libSVM when excluding that feature. Therefore I am pretty sure it's a problem of missing scaling.

How do I replicate programmatically (i.e. without calling svm-scale) the scaling process of SVM?

like image 542
luke14free Avatar asked Nov 10 '12 17:11

luke14free


People also ask

How do you scale data in SVM in Python?

You should use a Scaler for this, not the freestanding function scale . A Scaler can be plugged into a Pipeline , e.g. scaling_svm = Pipeline([("scaler", Scaler()), ("svm", SVC(C=1000))]) . Does the Scaler do standardization separately to training and testing data in Pipeline ?

Do I need to scale data for SVM?

Because Support Vector Machine (SVM) optimization occurs by minimizing the decision vector w, the optimal hyperplane is influenced by the scale of the input features and it's therefore recommended that data be standardized (mean 0, var 1) prior to SVM model training.

Is SVM affected by scaling?

As a result, we see that feature scaling affects the SVM classifier outcome. Consequently, standardizing the feature values improves the classifier performance significantly.

What is feature scaling sklearn?

Feature Scaling or Standardization: It is a step of Data Pre Processing that is applied to independent variables or features of data. It basically helps to normalize the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm. Package Used: sklearn.preprocessing.


1 Answers

You have that functionality in sklearn.preprocessing:

>>> from sklearn import preprocessing
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X_scaled = preprocessing.scale(X)

>>> X_scaled                                          
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

The data will then have zero mean and unit variance.

like image 182
Maehler Avatar answered Sep 29 '22 13:09

Maehler