I'm using the MinMaxScaler
model in sklearn to normalize the features of a model.
training_set = np.random.rand(4,4)*10
training_set
[[ 6.01144787, 0.59753007, 2.0014852 , 3.45433657],
[ 6.03041646, 5.15589559, 6.64992437, 2.63440202],
[ 2.27733136, 9.29927394, 0.03718093, 7.7679183 ],
[ 9.86934288, 7.59003904, 6.02363739, 2.78294206]]
scaler = MinMaxScaler()
scaler.fit(training_set)
scaler.transform(training_set)
[[ 0.49184811, 0. , 0.29704831, 0.15972182],
[ 0.4943466 , 0.52384506, 1. , 0. ],
[ 0. , 1. , 0. , 1. ],
[ 1. , 0.80357559, 0.9052909 , 0.02893534]]
Now I want to use the same scaler to normalize the test set:
[[ 8.31263467, 7.99782295, 0.02031658, 9.43249727],
[ 1.03761228, 9.53173021, 5.99539478, 4.81456067],
[ 0.19715961, 5.97702519, 0.53347403, 5.58747666],
[ 9.67505429, 2.76225253, 7.39944931, 8.46746594]]
But I don't want so use the scaler.fit()
with the training data all the time. Is there a way to save the scaler and load it later from a different file?
You can save and load the model using the pickle operation to serialize your machine learning algorithms and save the serialized format to a file.
class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True) [source] Transforms features by scaling each feature to a given range. This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e. between zero and one.
Update: sklearn.externals.joblib
is deprecated. Install and use the pure joblib
instead. Please see Engineero's answer below, which is otherwise identical to mine.
Even better than pickle
(which creates much larger files than this method), you can use sklearn
's built-in tool:
from sklearn.externals import joblib
scaler_filename = "scaler.save"
joblib.dump(scaler, scaler_filename)
# And now to load...
scaler = joblib.load(scaler_filename)
So I'm actually not an expert with this but from a bit of research and a few helpful links, I think pickle
and sklearn.externals.joblib
are going to be your friends here.
The package pickle
lets you save models or "dump" models to a file.
I think this link is also helpful. It talks about creating a persistence model. Something that you're going to want to try is:
# could use: import pickle... however let's do something else
from sklearn.externals import joblib
# this is more efficient than pickle for things like large numpy arrays
# ... which sklearn models often have.
# then just 'dump' your file
joblib.dump(clf, 'my_dope_model.pkl')
Here is where you can learn more about the sklearn externals.
Let me know if that doesn't help or I'm not understanding something about your model.
Note: sklearn.externals.joblib
is deprecated. Install and use the pure joblib
instead
Just a note that sklearn.externals.joblib
has been deprecated and is superseded by plain old joblib
, which can be installed with pip install joblib
:
import joblib
joblib.dump(my_scaler, 'scaler.gz')
my_scaler = joblib.load('scaler.gz')
Note that file extensions can be anything, but if it is one of ['.z', '.gz', '.bz2', '.xz', '.lzma']
then the corresponding compression protocol will be used. Docs for joblib.dump()
and joblib.load()
methods.
You can use pickle
, to save the scaler:
import pickle
scalerfile = 'scaler.sav'
pickle.dump(scaler, open(scalerfile, 'wb'))
Load it back:
import pickle
scalerfile = 'scaler.sav'
scaler = pickle.load(open(scalerfile, 'rb'))
test_scaled_set = scaler.transform(test_set)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.externals import joblib
pipeline = make_pipeline(MinMaxScaler(),YOUR_ML_MODEL() )
model = pipeline.fit(X_train, y_train)
joblib.dump(model, 'filename.mod')
model = joblib.load('filename.mod')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With