I use two worker/replicas and one parameter server. like
--ps_hosts='hosta.com:2222' --worker_hosts='hosta.com:2223,hostb.com:2223'
Using the tf.train.SyncReplicasOptimizer like
opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate=2,
replica_id=FLAGS.task_id,
total_num_replicas=2,
variables_to_average=variables_to_average)
From the log i see worker0(hosta.com:2223) is much faster than worker1(hostb.com:2223) due to cross machine network communicate. Looks like worker0 didn't wait the gradients from worker1. Even after I kill the job of worker1, worker0 is still processing. And worker0 has many duplicate logs like
INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.659749: step 29010, loss = 0.40(812.0 examples/sec; 0.315 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.990509: step 29010, loss = 0.59(775.3 examples/sec; 0.330 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.650522: step 29013, loss = 0.56(774.0 examples/sec; 0.331 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.989555: step 29013, loss = 0.47(756.3 examples/sec; 0.338 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:06.549120: step 29016, loss = 0.49(816.6 examples/sec; 0.313 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:06.867229: step 29016, loss = 0.48(806.1 examples/sec; 0.318 sec/batch)
So, shouldn't the tf.train.SyncReplicasOptimizer hung up and wait all of the replicas_to_aggregate workers?
The tf.train.SyncReplicasOptimizer
only requires that it receives gradients from replicas_to_aggregate
different steps before aggregating and applying them, but does not require that they come from different processes. Your worker0
appears to be running at least twice as fast as worker1
, and is completing two steps before worker1
completes one step.
As you have noticed, this is not an efficient use of distributed resources! I would suggest trying to balance your system, so that the parameters are served from one or more machines that have equal bandwidth to the two workers. One possibility would be to add another parameter server, running on hostb.com:2222
, so that (approximately) half of the parameters are local to each worker.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With