passing numpy arrays through multiprocessing.Queue

Question

I'm using multiprocessing.Queue to pass numpy arrays of float64 between python processes. This is working fine, but I'm worried it may not be as efficient as it could be.

According to the documentation of multiprocessing, objects placed on the Queue will be pickled. calling pickle on a numpy array results in a text representation of the data, so null bytes get replaced by the string "\x00".

>>> pickle.dumps(numpy.zeros(10)) "cnumpy.core.multiarray _reconstruct p0 (cnumpy ndarray p1 (I0 tp2 S'b' p3 tp4 Rp5 (I1 (I10 tp6 cnumpy dtype p7 (S'f8' p8 I0 I1 tp9 Rp10 (I3 S'<' p11 NNNI-1 I-1 I0 tp12 bI00 S'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' p13 tp14 b."

I'm concerned that this means my arrays are being expensively converted into something 4x the original size and then converted back in the other process.

Is there a way to pass the data through the queue in a raw unaltered form?

I know about shared memory, but if that is the correct solution, I'm not sure how to build a queue on top of it.

Thanks!

tom10 · Accepted Answer

The issue isn't with numpy, but the default settings for how pickle represents data (as strings so the output is human readable). You can change the default settings for pickle to produce binary data instead.

import numpy
import cPickle as pickle

N = 1000
a0 = pickle.dumps(numpy.zeros(N))
a1 = pickle.dumps(numpy.zeros(N), protocol=-1)

print "a0", len(a0)   # 32155
print "a1", len(a1)   #  8133

Also, note, that if you want to decrease processor work and time, you should probably use cPickle instead of pickle (but the space savings due to using the binary protocol work regardless of pickle version).

On shared memory:
On the question of shared memory, there are a few things to consider. Shared data typically adds a significant amount of complexity to code. Basically, for every line of code that uses that data, you will need to worry about whether some other line of code in another process is simultaneously using that data. How hard this will be will depend on what you're doing. The advantages are that you save time sending the data back and forth. The question that Eelco cites is for a 60GB array, and for this there's really no choice, it has to be shared. On the other hand, for most reasonably complex code, deciding to share data simply to save a few microseconds or bytes would probably be one of the worst premature optimizations one could make.

Eelco Hoogendoorn · Answer

Share Large, Read-Only Numpy Array Between Multiprocessing Processes

That should cover it all. Pickling of uncompressible binary data is a pain regardless of the protocol used, so this solution is much to be preferred.

passing numpy arrays through multiprocessing.Queue

Tags:

python

parallel-processing

multiprocessing

numpy

pickle

E_G

2 Answers

tom10

Eelco Hoogendoorn

Recent Activity

Donate For Us

passing numpy arrays through multiprocessing.Queue

Tags:

python

parallel-processing

multiprocessing

numpy

pickle

E_G

2 Answers

tom10

Eelco Hoogendoorn

Related questions

Recent Activity

Donate For Us