Background: I'm just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle. <blockquote> it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string </blockquote> I read this Q&A on Pickle, Common use-cases for pickle in Python and wonder if the community here can share the differences between joblib and pickle? When should one use one over another?

<ul> <li> joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.</li> <li> joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.</li> <li> if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.</li> <li>since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.</li> <li>But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with <code>mmap_mode="r"</code>.</li> </ul>

What are the different use cases of joblib versus pickle?

1 Answers

joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with mmap_mode="r".

127

answered Sep 21 '22 08:09

ogrisel

Related questions
                            
                                Add Text on Image using PIL
                            
                                Print new output on same line [duplicate]
                            
                                bash: mkvirtualenv: command not found
                            
                                How to determine whether a column/variable is numeric or not in Pandas/NumPy?
                            
                                how to "reimport" module to python then code be changed after import
                            
                                Permission denied when activating venv
                            
                                How to round the minute of a datetime object
                            
                                How to get text with Selenium WebDriver in Python
                            
                                how to split an iterable in constant-size chunks
                            
                                enumerate() for dictionary in Python
                            
                                Java: Equivalent of Python's range(int, int)?
                            
                                Spark DataFrame groupBy and sort in the descending order (pyspark)
                            
                                Resource u'tokenizers/punkt/english.pickle' not found
                            
                                Is there a more elegant way to express ((x == a and y == b) or (x == b and y == a))?
                            
                                Kill process by name?
                            
                                Deleting multiple columns based on column names in Pandas
                            
                                How to maximize a plt.show() window using Python
                            
                                Anaconda vs. EPD Enthought vs. manual installation of Python [closed]
                            
                                Python: importing a sub‑package or sub‑module
                            
                                Python - abs vs fabs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the different use cases of joblib versus pickle?

Tags:

python

pickle

scikit-learn

msunbot

People also ask

1 Answers

ogrisel

Recent Activity

Donate For Us