Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

distributed tensorflow tf.train.SyncReplicasOptimizer seems not synchronized

I use two worker/replicas and one parameter server. like

--ps_hosts='hosta.com:2222' --worker_hosts='hosta.com:2223,hostb.com:2223'

Using the tf.train.SyncReplicasOptimizer like

opt = tf.train.SyncReplicasOptimizer(
            opt,
            replicas_to_aggregate=2,
            replica_id=FLAGS.task_id,
            total_num_replicas=2,
            variables_to_average=variables_to_average)

From the log i see worker0(hosta.com:2223) is much faster than worker1(hostb.com:2223) due to cross machine network communicate. Looks like worker0 didn't wait the gradients from worker1. Even after I kill the job of worker1, worker0 is still processing. And worker0 has many duplicate logs like

INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.659749: step 29010, loss = 0.40(812.0 examples/sec; 0.315  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.990509: step 29010, loss = 0.59(775.3 examples/sec; 0.330  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.650522: step 29013, loss = 0.56(774.0 examples/sec; 0.331  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.989555: step 29013, loss = 0.47(756.3 examples/sec; 0.338  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:06.549120: step 29016, loss = 0.49(816.6 examples/sec; 0.313  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:06.867229: step 29016, loss = 0.48(806.1 examples/sec; 0.318  sec/batch)

So, shouldn't the tf.train.SyncReplicasOptimizer hung up and wait all of the replicas_to_aggregate workers?

like image 493
LiuJia Avatar asked Feb 08 '23 05:02

LiuJia


1 Answers

The tf.train.SyncReplicasOptimizer only requires that it receives gradients from replicas_to_aggregate different steps before aggregating and applying them, but does not require that they come from different processes. Your worker0 appears to be running at least twice as fast as worker1, and is completing two steps before worker1 completes one step.

As you have noticed, this is not an efficient use of distributed resources! I would suggest trying to balance your system, so that the parameters are served from one or more machines that have equal bandwidth to the two workers. One possibility would be to add another parameter server, running on hostb.com:2222, so that (approximately) half of the parameters are local to each worker.

like image 123
mrry Avatar answered Apr 13 '23 01:04

mrry