I did a sample program to train a SVM using sklearn. Here is the code <pre class="prettyprint"><code>from sklearn import svm from sklearn import datasets from sklearn.externals import joblib clf = svm.SVC() iris = datasets.load_iris() X, y = iris.data, iris.target clf.fit(X, y) print(clf.predict(X)) joblib.dump(clf, 'clf.pkl') </code></pre> When I dump the model file I get this amount of files. : ['clf.pkl', 'clf.pkl_01.npy', 'clf.pkl_02.npy', 'clf.pkl_03.npy', 'clf.pkl_04.npy', 'clf.pkl_05.npy', 'clf.pkl_06.npy', 'clf.pkl_07.npy', 'clf.pkl_08.npy', 'clf.pkl_09.npy', 'clf.pkl_10.npy', 'clf.pkl_11.npy'] I am confused if I did something wrong. Or is this normal? What is *.npy files. And why there are 11?

To save everything into 1 file you should set compression to True or any number (1 for example). But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays. If you want to have one file serialization of whole object (And don't want to save memmap np arrays) - i think that it would be better to use Pickle, AFAIK in this case joblib dump/load functionality will work at same speed as Pickle. <pre class="prettyprint"><code>import numpy as np from scikit-learn.externals import joblib vector = np.arange(0, 10**7) %timeit joblib.dump(vector, 'vector.pkl') # 1 loops, best of 3: 818 ms per loop # file size ~ 80 MB %timeit vector_load = joblib.load('vector.pkl') # 10 loops, best of 3: 47.6 ms per loop # Compressed %timeit joblib.dump(vector, 'vector.pkl', compress=1) # 1 loops, best of 3: 1.58 s per loop # file size ~ 15.1 MB %timeit vector_load = joblib.load('vector.pkl') # 1 loops, best of 3: 442 ms per loop # Pickle %%timeit with open('vector.pkl', 'wb') as f: pickle.dump(vector, f) # 1 loops, best of 3: 927 ms per loop %%timeit with open('vector.pkl', 'rb') as f: vector_load = pickle.load(f) # 10 loops, best of 3: 94.1 ms per loop </code></pre>

sklearn dumping model using joblib, dumps multiple files. Which one is the correct model?

Tags:

python

machine-learning

scikit-learn

joblib

I did a sample program to train a SVM using sklearn. Here is the code

from sklearn import svm
from sklearn import datasets
from sklearn.externals import joblib

clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)

print(clf.predict(X))
joblib.dump(clf, 'clf.pkl')

When I dump the model file I get this amount of files. :

['clf.pkl', 'clf.pkl_01.npy', 'clf.pkl_02.npy', 'clf.pkl_03.npy', 'clf.pkl_04.npy', 'clf.pkl_05.npy', 'clf.pkl_06.npy', 'clf.pkl_07.npy', 'clf.pkl_08.npy', 'clf.pkl_09.npy', 'clf.pkl_10.npy', 'clf.pkl_11.npy']

I am confused if I did something wrong. Or is this normal? What is *.npy files. And why there are 11?

727

asked Nov 03 '15 10:11

kcc__

1 Answers

To save everything into 1 file you should set compression to True or any number (1 for example).

But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays. If you want to have one file serialization of whole object (And don't want to save memmap np arrays) - i think that it would be better to use Pickle, AFAIK in this case joblib dump/load functionality will work at same speed as Pickle.

import numpy as np
from scikit-learn.externals import joblib

vector = np.arange(0, 10**7)

%timeit joblib.dump(vector, 'vector.pkl')
# 1 loops, best of 3: 818 ms per loop
# file size ~ 80 MB
%timeit vector_load = joblib.load('vector.pkl')
# 10 loops, best of 3: 47.6 ms per loop

# Compressed
%timeit joblib.dump(vector, 'vector.pkl', compress=1)
# 1 loops, best of 3: 1.58 s per loop
# file size ~ 15.1 MB
%timeit vector_load = joblib.load('vector.pkl')
# 1 loops, best of 3: 442 ms per loop

# Pickle
%%timeit
with open('vector.pkl', 'wb') as f:
    pickle.dump(vector, f)
# 1 loops, best of 3: 927 ms per loop
%%timeit                                    
with open('vector.pkl', 'rb') as f:
    vector_load = pickle.load(f)
# 10 loops, best of 3: 94.1 ms per loop

answered Sep 16 '22 12:09

Ibraim Ganiev

Related questions
                            
                                Can we use regular expressions to check if there are an odd number of each type of character?
                            
                                How do I disable the keyboard shortcuts in Matplotlib?
                            
                                Get country code for timezone using pytz?
                            
                                have sphinx report broken links
                            
                                Python module won't install
                            
                                How do I plot a spectrogram the same way that pylab's specgram() does?
                            
                                What's the unit of RSS in psutil.Process.get_memory_info?
                            
                                Filtering two lists simultaneously
                            
                                shebang env preferred python version
                            
                                How to compile a string of Python code into a module whose functions can be called?
                            
                                A ThreadPoolExecutor inside a ProcessPoolExecutor
                            
                                Is it possible to read data from an Excel sheet in Python using Xlsxwriter? If so how?
                            
                                Print all fields of ctypes "Structure" with introspection
                            
                                Finding next occurring tag and its enclosed text with Beautiful Soup
                            
                                Inline in Django admin: has no ForeignKey
                            
                                Python get most recent file in a directory with certain extension
                            
                                Count occurrences of items in Series in each row of a DataFrame
                            
                                Compiling Cx-Freeze under Ubuntu
                            
                                cannot use current_user in jinja2 macro?
                            
                                How to concatenate videos in moviepy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With