In TensorFlow, I want to have a filename queue shared across different workers on different machines, such that each machine can get a subset of files to train. I searched a lot, and it seems that only variables could be put on a PS task to be shared. Does anyone have any example? Thanks.
TensorFlow supports distributed computing, allowing portions of the graph to be computed on different processes, which may be on completely different servers! In addition, this can be used to distribute computation to servers with powerful GPUs, and have other computations done on servers with more memory, and so on.
The TensorFlow runtime parallelizes graph execution across many different dimensions: The individual ops have parallel implementations, using multiple cores in a CPU, or multiple threads in a GPU.
It is possible to share the same queue across workers, by setting the optional shared_name
argument when creating the queue. Just as with tf.Variable
objects, you can place the queue on any device that can be accessed from different workers. For example:
with tf.device("/job:ps/task:0"): # Place queue on parameter server.
q = tf.FIFOQueue(..., shared_name="shared_queue")
A few notes:
The value for shared_name
must be unique to the particular queue that you are sharing. Unfortunately, the Python API does not currently use scoping or automatic name uniqification to make this easier, so you will have to ensure this manually.
You do not need to place the queue on a parameter server. One possible configuration would be to set up an additional "input job" (e.g. "/job:input"
) containing a set of tasks that perform pre-processing, and export a shared queue for the workers to use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With