Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store scaling parameters for later use

I want to apply the scaling sklearn.preprocessing.scale module that scikit-learn offers for centering a dataset that I will use to train an svm classifier.

How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?

I know I can use the standarScaler but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?

like image 833
LetsPlayYahtzee Avatar asked Mar 11 '16 16:03

LetsPlayYahtzee


People also ask

What is StandardScaler ()?

StandardScaler is the industry's go-to algorithm. 🙂 StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does not meet the strict definition of scale I introduced earlier.

Is StandardScaler same as Z score?

where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows: StandardScaler results in a distribution with a standard deviation equal to 1. The variance is equal to 1 also, because variance = standard deviation squared.

Why is StandardScaler used?

StandardScaler removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way. StandardScaler can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature.


2 Answers

I think that the best way is to pickle it post fit, as this is the most generic option. Perhaps you'll later create a pipeline composed of both a feature extractor and scaler. By pickling a (possibly compound) stage, you're making things more generic. The sklearn documentation on model persistence discusses how to do this.

Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters:

scale_ : ndarray, shape (n_features,) Per feature relative scaling of the data. New in version 0.17: scale_ is recommended instead of deprecated std_. mean_ : array of floats with shape [n_features] The mean value for each feature in the training set.

The following short snippet illustrates this:

from sklearn import preprocessing
import numpy as np

s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))
like image 125
Ami Tavory Avatar answered Oct 24 '22 19:10

Ami Tavory


Scale with standard scaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

save mean_ and var_ for later use

means = scaler.mean_ 
vars = scaler.var_    

(you can print and copy paste means and vars or save to disk with np.save....)

Later use of saved parameters

def scale_data(array,means=means,stds=vars **0.5):
    return (array-means)/stds

scale_new_data = scale_data(new_data)
like image 26
Ioannis Nasios Avatar answered Oct 24 '22 20:10

Ioannis Nasios