I did a sample program to train a SVM using sklearn. Here is the code
from sklearn import svm
from sklearn import datasets
from sklearn.externals import joblib
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)
print(clf.predict(X))
joblib.dump(clf, 'clf.pkl')
When I dump the model file I get this amount of files. :
['clf.pkl', 'clf.pkl_01.npy', 'clf.pkl_02.npy', 'clf.pkl_03.npy', 'clf.pkl_04.npy', 'clf.pkl_05.npy', 'clf.pkl_06.npy', 'clf.pkl_07.npy', 'clf.pkl_08.npy', 'clf.pkl_09.npy', 'clf.pkl_10.npy', 'clf.pkl_11.npy']
I am confused if I did something wrong. Or is this normal? What is *.npy files. And why there are 11?
By default, joblib.dump() uses the zlib compression method as it gives the best tradeoff between speed and disk space. The other supported compression methods are 'gzip', 'bz2', 'lzma' and 'xz': >>> # Dumping in a gzip compressed file using a compress level of 3. >>> joblib.
Joblib is a set of tools to provide lightweight pipelining in Python. In particular: transparent disk-caching of functions and lazy re-evaluation (memoize pattern) easy simple parallel computing.
Sklearn Joblib Summary You can connect joblib to the Dask backend to scale out to a remote cluster for even faster processing times. You can use XGBoost-on-Dask and/or dask-ml for distributed machine learning training on datasets that don't fit into local memory.
if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
Save Your Model with joblib Joblib is part of the SciPy ecosystem and provides utilities for pipelining Python jobs. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently.
But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays.
9.1. Python specific serialization ¶ It is possible to save a model in scikit-learn by using Python’s built-in persistence model, namely pickle:
Presumably those are numpyarrays for your data, joblibwhen loading back the .pklwill look for those nparrays and load back the model data – EdChum
To save everything into 1 file you should set compression to True or any number (1 for example).
But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays. If you want to have one file serialization of whole object (And don't want to save memmap np arrays) - i think that it would be better to use Pickle, AFAIK in this case joblib dump/load functionality will work at same speed as Pickle.
import numpy as np
from scikit-learn.externals import joblib
vector = np.arange(0, 10**7)
%timeit joblib.dump(vector, 'vector.pkl')
# 1 loops, best of 3: 818 ms per loop
# file size ~ 80 MB
%timeit vector_load = joblib.load('vector.pkl')
# 10 loops, best of 3: 47.6 ms per loop
# Compressed
%timeit joblib.dump(vector, 'vector.pkl', compress=1)
# 1 loops, best of 3: 1.58 s per loop
# file size ~ 15.1 MB
%timeit vector_load = joblib.load('vector.pkl')
# 1 loops, best of 3: 442 ms per loop
# Pickle
%%timeit
with open('vector.pkl', 'wb') as f:
pickle.dump(vector, f)
# 1 loops, best of 3: 927 ms per loop
%%timeit
with open('vector.pkl', 'rb') as f:
vector_load = pickle.load(f)
# 10 loops, best of 3: 94.1 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With