Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between tensorflow on spark with the default distributed tensorflow 1.0?

I am trying to install tensorflow on spark onto the server, as I was told by my boss because he thought it would be easy to use. But I also learnt the default distributed tensorflow on the tensorflow website. Can any expert tell me the difference between these two choice of distribution? Will spark automatically assign the parameter server or workers?

Thanks in advance.

like image 550
Jeff Wang Avatar asked May 23 '17 01:05

Jeff Wang


People also ask

What is the difference between Spark and TensorFlow?

TensorFlow is an open-source AI library from Google that allows for data flow graphs to build models. Apache Spark is a real-time data processing system with support for diverse data sources and programming styles, providing a framework for machine learning.

What is the advantage of using distributed training in TensorFlow?

Advantages. It can train large models with millions and billions of parameters like: GPT-3, GPT-2, BERT, et cetera. Potentially low latency across the workers. Good TensorFlow community support.

Can you use TensorFlow with Spark?

Since TensorFlow can use all the cores on each worker, we only run one task at one time on each worker and we batch them together to limit contention. The TensorFlow library can be installed on Spark clusters as a regular Python library, following the instructions on the TensorFlow website.

What is distributed training in TensorFlow?

tf. distribute. Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.


1 Answers

I finally installed TensorflowOnSpark(TFOS) on the server and compared it with the default distributed Tensorflow(TF). And my conclusion is:

Pros:

  1. TFOS is more automatic. I don’t need to define which node in the cluster as the PS node. I also don’t need to upload the same code to all the nodes.
  2. I don’t need to input the command line on each node to start the training.
  3. The code change for running on TFOS is not much.

Cons:

  1. Sometime, two worker nodes will be automatically assigned to the same GPU and core (K80 with two cores). And it will cause out of memory problem.
  2. You need input a long list of configuration on command line before running.
  3. You cannot specify which node to be PS node.

If I am wrong somewhere, please correct me.

like image 191
Jeff Wang Avatar answered Sep 28 '22 12:09

Jeff Wang