I want to apply the scaling sklearn.preprocessing.scale
module that scikit-learn
offers for centering a dataset that I will use to train an svm classifier.
How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?
I know I can use the standarScaler
but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?
StandardScaler is the industry's go-to algorithm. 🙂 StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does not meet the strict definition of scale I introduced earlier.
where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows: StandardScaler results in a distribution with a standard deviation equal to 1. The variance is equal to 1 also, because variance = standard deviation squared.
StandardScaler removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way. StandardScaler can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature.
I think that the best way is to pickle it post fit
, as this is the most generic option. Perhaps you'll later create a pipeline composed of both a feature extractor and scaler. By pickling a (possibly compound) stage, you're making things more generic. The sklearn documentation on model persistence discusses how to do this.
Having said that, you can query sklearn.preprocessing.StandardScaler
for the fit parameters:
scale_ : ndarray, shape (n_features,) Per feature relative scaling of the data. New in version 0.17: scale_ is recommended instead of deprecated std_. mean_ : array of floats with shape [n_features] The mean value for each feature in the training set.
The following short snippet illustrates this:
from sklearn import preprocessing
import numpy as np
s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))
Scale with standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)
save mean_ and var_ for later use
means = scaler.mean_
vars = scaler.var_
(you can print and copy paste means and vars or save to disk with np.save....)
Later use of saved parameters
def scale_data(array,means=means,stds=vars **0.5):
return (array-means)/stds
scale_new_data = scale_data(new_data)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With