Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the reason to use parameter server in distributed tensorflow learning?

Short version: can't we store variables in one of the workers and not use parameter servers?

Long version: I want to implement synchronous distributed learning of neural network in tensorflow. I want each worker to have a full copy of the model during training.

I've read distributed tensorflow tutorial and code of distributed training imagenet and didn't get why do we need parameter servers.

I see that they are used for storing values of variables and replica_device_setter takes care that variables are evenly distributed between parameter servers (probably it does something more, I wasn't able to fully understand the code).

The question is: why don't we use one of the workers to store variables? Will I achieve that if I use

with tf.device('/job:worker/task:0/cpu:0'):

instead of

with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):

for Variaibles? If that works is there downside comparing to solution with parameter servers?

like image 365
kolesov93 Avatar asked Sep 18 '16 15:09

kolesov93


People also ask

What is parameter server in deep learning?

A parameter server is a key-value store used for training machine learning models on a cluster. The values are the parameters of a machine-learning model (e.g., a neural network). The keys index the model parameters. For example, in a movie recommendation system, there may be one key per user and one key per movie.

What is the advantage of using distributed training in TensorFlow?

Advantages. It can train large models with millions and billions of parameters like: GPT-3, GPT-2, BERT, et cetera. Potentially low latency across the workers. Good TensorFlow community support.

Is a TensorFlow API to distribute training across multiple GPUs multiple machines or TPUs?

Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.

Is a pythonic flexible library to distribute the workflow across multiple GPUs?

Dask is a flexible library for parallel computing in Python which makes scaling out your workflow smooth and simple.


1 Answers

Using parameter server can give you better network utilization, and lets you scale your models to more machines.

A concrete example, suppose you have 250M parameters, it takes 1 second to compute gradient on each worker, and there are 10 workers. This means that each worker has to send/receive 1 GB of data to 9 other workers every second, which needs 72 Gbps full duplex network capacity on each worker, which is not practical.

More realistically you could have 10 Gbps network capacity per worker. You prevent network bottlenecks by using parameter server split over 8 machines. Each worker machine communicates with each parameter machine for 1/8th of parameters.

like image 60
Yaroslav Bulatov Avatar answered Sep 28 '22 07:09

Yaroslav Bulatov