Deserialization of large numpy arrays using pickle is order of magnitude slower than using numpy

Tags:

I am deserializing large numpy arrays (500MB in this example) and I find the results vary by orders of magnitude between approaches. Below are the 3 approaches I've timed.

I'm receiving the data from the multiprocessing.shared_memory package, so the data comes to me as a memoryview object. But in these simple examples, I just pre-create a byte array to run the test.

I wonder if there are any mistakes in these approaches, or if there are other techniques I didn't try. Deserialization in Python is a real pickle of a problem if you want to move data fast and not lock the GIL just for the IO. A good explanation as to why these approaches vary so much would also be a good answer.

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))

Results:

Deserialize using pickle...
Time: 0.2509949207 sec
Deserialize from bytes...
Time: 0.0204288960 sec
Deserialize using numpy load from BytesIO...
Time: 28.9850852489 sec

The second option is the fastest, but notably less elegant because I need to explicitly serialize the shape and dtype information.

949

asked Jun 12 '20 21:06

David Parks

1 Answers

I found your question useful, I'm looking for best numpy serialization and confirmed that np.load() was best except it was beaten by pyarrow in my add on test below. Arrow is now a super popular data serialization framework for distributed compute (E.g. Spark, ...)

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io
import pyarrow as pa


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
pa_buf = pa.serialize(sample).to_buffer()

serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize pyarrow')
t0 = time.time()
restored_data = pa.deserialize(pa_buf)
print('Time: {:.10f} sec'.format(time.time() - t0))

Results from i3.2xlarge on Databricks Runtime 8.3ML Python 3.8, Numpy 1.19.2, Pyarrow 1.0.1

Deserialize using pickle...
Time: 0.4069395065 sec
Deserialize from bytes...
Time: 0.0281322002 sec
Deserialize using numpy load from BytesIO...
Time: 0.3059172630 sec
Deserialize pyarrow
Time: 0.0031735897 sec

Your BytesIO results were about 100x more than mine, which I don't know why.

182

answered Sep 25 '22 04:09

Douglas M

Related questions
                            
                                Python program Airnef stuck while downloading images
                            
                                Error printing variables while debugging Cython
                            
                                github not rendering jupyter notebook python
                            
                                Achieving shell-like pipeline performance in Python
                            
                                Adding StandardScaler() of values as new column to DataFrame returns partly NaNs
                            
                                Does VSCode support Python .pyi files for IntelliSense?
                            
                                Change QDocketWidget hover title bar color with CSS
                            
                                Defining an API for complex View generator function (with many configurables)
                            
                                Using python lime as a udf on spark
                            
                                Singledispatch and type as an input argument
                            
                                Exhaustively get all the possible combinations of a word of three lettters
                            
                                Define an attribute of a dataclass with a reserved word "class" and serialize it
                            
                                mypy error: Callable has no attribute "__get__"
                            
                                Conditional Cumulative Sums in Pandas
                            
                                Infer generic lambda parameters from another generic lambda's parameters
                            
                                How to resolve a Field nested in another Type using Graphene GraphQL?
                            
                                Airflow giving log file does not exist error while running on Docker
                            
                                Sharing GPU memory between process on a same GPU with Pytorch
                            
                                Python: Ctypes how to check memory management
                            
                                Error on tensorflow cannot import name 'export_saved_model'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Deserialization of large numpy arrays using pickle is order of magnitude slower than using numpy

Tags:

python

numpy

deserialization

python-3.8

David Parks

People also ask

1 Answers

Douglas M

Recent Activity

Donate For Us