Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn dumping model using joblib, dumps multiple files. Which one is the correct model?

I did a sample program to train a SVM using sklearn. Here is the code

from sklearn import svm
from sklearn import datasets
from sklearn.externals import joblib

clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)

print(clf.predict(X))
joblib.dump(clf, 'clf.pkl') 

When I dump the model file I get this amount of files. :

['clf.pkl', 'clf.pkl_01.npy', 'clf.pkl_02.npy', 'clf.pkl_03.npy', 'clf.pkl_04.npy', 'clf.pkl_05.npy', 'clf.pkl_06.npy', 'clf.pkl_07.npy', 'clf.pkl_08.npy', 'clf.pkl_09.npy', 'clf.pkl_10.npy', 'clf.pkl_11.npy']

I am confused if I did something wrong. Or is this normal? What is *.npy files. And why there are 11?

like image 727
kcc__ Avatar asked Nov 03 '15 10:11

kcc__


People also ask

What is joblib dump ()?

By default, joblib.dump() uses the zlib compression method as it gives the best tradeoff between speed and disk space. The other supported compression methods are 'gzip', 'bz2', 'lzma' and 'xz': >>> # Dumping in a gzip compressed file using a compress level of 3. >>> joblib.

What are joblib files?

Joblib is a set of tools to provide lightweight pipelining in Python. In particular: transparent disk-caching of functions and lazy re-evaluation (memoize pattern) easy simple parallel computing.

What is joblib Sklearn?

Sklearn Joblib Summary You can connect joblib to the Dask backend to scale out to a remote cluster for even faster processing times. You can use XGBoost-on-Dask and/or dask-ml for distributed machine learning training on datasets that don't fit into local memory.

What is the difference between Pickle and joblib?

if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.

What is JOBLIB in SciPy?

Save Your Model with joblib Joblib is part of the SciPy ecosystem and provides utilities for pipelining Python jobs. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently.

What is the difference between pickle and JOBLIB?

But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays.

Is it possible to save a model in scikit-learn?

9.1. Python specific serialization ¶ It is possible to save a model in scikit-learn by using Python’s built-in persistence model, namely pickle:

What is the difference between JOBLIB and PKL?

Presumably those are numpyarrays for your data, joblibwhen loading back the .pklwill look for those nparrays and load back the model data – EdChum


1 Answers

To save everything into 1 file you should set compression to True or any number (1 for example).

But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays. If you want to have one file serialization of whole object (And don't want to save memmap np arrays) - i think that it would be better to use Pickle, AFAIK in this case joblib dump/load functionality will work at same speed as Pickle.

import numpy as np
from scikit-learn.externals import joblib

vector = np.arange(0, 10**7)

%timeit joblib.dump(vector, 'vector.pkl')
# 1 loops, best of 3: 818 ms per loop
# file size ~ 80 MB
%timeit vector_load = joblib.load('vector.pkl')
# 10 loops, best of 3: 47.6 ms per loop

# Compressed
%timeit joblib.dump(vector, 'vector.pkl', compress=1)
# 1 loops, best of 3: 1.58 s per loop
# file size ~ 15.1 MB
%timeit vector_load = joblib.load('vector.pkl')
# 1 loops, best of 3: 442 ms per loop

# Pickle
%%timeit
with open('vector.pkl', 'wb') as f:
    pickle.dump(vector, f)
# 1 loops, best of 3: 927 ms per loop
%%timeit                                    
with open('vector.pkl', 'rb') as f:
    vector_load = pickle.load(f)
# 10 loops, best of 3: 94.1 ms per loop
like image 62
Ibraim Ganiev Avatar answered Sep 16 '22 12:09

Ibraim Ganiev