Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

passing numpy arrays through multiprocessing.Queue

I'm using multiprocessing.Queue to pass numpy arrays of float64 between python processes. This is working fine, but I'm worried it may not be as efficient as it could be.

According to the documentation of multiprocessing, objects placed on the Queue will be pickled. calling pickle on a numpy array results in a text representation of the data, so null bytes get replaced by the string "\\x00".

>>> pickle.dumps(numpy.zeros(10)) "cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I10\ntp6\ncnumpy\ndtype\np7\n(S'f8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb."

I'm concerned that this means my arrays are being expensively converted into something 4x the original size and then converted back in the other process.

Is there a way to pass the data through the queue in a raw unaltered form?

I know about shared memory, but if that is the correct solution, I'm not sure how to build a queue on top of it.

Thanks!

like image 594
E_G Avatar asked Jan 28 '14 15:01

E_G


2 Answers

The issue isn't with numpy, but the default settings for how pickle represents data (as strings so the output is human readable). You can change the default settings for pickle to produce binary data instead.

import numpy
import cPickle as pickle

N = 1000
a0 = pickle.dumps(numpy.zeros(N))
a1 = pickle.dumps(numpy.zeros(N), protocol=-1)

print "a0", len(a0)   # 32155
print "a1", len(a1)   #  8133

Also, note, that if you want to decrease processor work and time, you should probably use cPickle instead of pickle (but the space savings due to using the binary protocol work regardless of pickle version).

On shared memory:
On the question of shared memory, there are a few things to consider. Shared data typically adds a significant amount of complexity to code. Basically, for every line of code that uses that data, you will need to worry about whether some other line of code in another process is simultaneously using that data. How hard this will be will depend on what you're doing. The advantages are that you save time sending the data back and forth. The question that Eelco cites is for a 60GB array, and for this there's really no choice, it has to be shared. On the other hand, for most reasonably complex code, deciding to share data simply to save a few microseconds or bytes would probably be one of the worst premature optimizations one could make.

like image 113
tom10 Avatar answered Oct 20 '22 13:10

tom10


Share Large, Read-Only Numpy Array Between Multiprocessing Processes

That should cover it all. Pickling of uncompressible binary data is a pain regardless of the protocol used, so this solution is much to be preferred.

like image 24
Eelco Hoogendoorn Avatar answered Oct 20 '22 14:10

Eelco Hoogendoorn