Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the different use cases of joblib versus pickle?

Background: I'm just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle.

it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string

I read this Q&A on Pickle, Common use-cases for pickle in Python and wonder if the community here can share the differences between joblib and pickle? When should one use one over another?

like image 485
msunbot Avatar asked Sep 27 '12 06:09

msunbot


People also ask

What is difference between joblib and pickle?

joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.

Does joblib use pickle?

joblib. dump() and joblib. load() are based on the Python pickle serialization model, which means that arbitrary Python code can be executed when loading a serialized object with joblib.

What is joblib used for?

Joblib is a set of tools to provide lightweight pipelining in Python. In particular: transparent disk-caching of functions and lazy re-evaluation (memoize pattern) easy simple parallel computing.

Does joblib dump overwrite?

2 Answers. Show activity on this post. Instead of a path if you pass in a file opened with "wb" then it will overwrite.


1 Answers

  • joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
  • joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
  • if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
  • since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
  • But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with mmap_mode="r".
like image 127
ogrisel Avatar answered Sep 21 '22 08:09

ogrisel