Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MirroredStrategy without NCCL

Tags:

tensorflow

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 x64
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.8.0
  • Python version: 3.6
  • Bazel version (if compiling from source): -
  • GCC/Compiler version (if compiling from source): -
  • CUDA/cuDNN version: 9.0
  • GPU model and memory: 3.5
  • Exact command to reproduce: simple_tfkeras_example.py

I would like to use MirroredStrategy to use multiple GPUs in the same machine. I tried one of the examples: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/python/examples/simple_tfkeras_example.py

The result is: ValueError: Op type not registered 'NcclAllReduce' in binary running on RAID. Make sure the Op and Kernel are registered in the binary running in this process. while building NodeDef 'NcclAllReduce'

I am using Windows, therefore Nccl is not available. Is it possible to force TensorFlow not to use this library?

like image 534
Sanyo Avatar asked Jun 05 '18 13:06

Sanyo


People also ask

What is mirrored strategy?

Mirrored Strategy MirroredStrategy is a method that you can use to perform synchronous distributed training across multiple GPUs. Using this method, you can create replicas of your model variables which are mirrored across your GPUs.

Does TensorFlow use both GPUs?

Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes. tf.

What is with strategy Scope ()?

Strategy. scope to specify that a strategy should be used when building an executing your model. (This puts you in the "cross-replica context" for this strategy, which means the strategy is put in control of things like variable placement.)


1 Answers

There are some binaries for NCCL on Windows, but they can be quite annoying to deal with.

As an alternative, Tensorflow gives you three other options in MirroredStrategy that are compatible with Windows natively. They are Hierarchical Copy, Reduce to First GPU, and Reduce to CPU. What you are most likely looking for is Hierarchical Copy, but you can test each of them to see what gives you the best result.

If you are using tensorflow versions older than 2.0, you will use tf.contrib.distribute:

# Hierarchical Copy
cross_tower_ops = tf.contrib.distribute.AllReduceCrossTowerOps(
        'hierarchical_copy', num_packs=number_of_gpus))
    strategy = tf.contrib.distribute.MirroredStrategy(cross_tower_ops=cross_tower_ops)

# Reduce to First GPU
cross_tower_ops = tf.contrib.distribute. ReductionToOneDeviceCrossTowerOps()
strategy = tf.contrib.distribute.MirroredStrategy(cross_tower_ops=cross_tower_ops)

# Reduce to CPU
cross_tower_ops = tf.contrib.distribute. ReductionToOneDeviceCrossTowerOps(
    reduce_to_device="/device:CPU:0")
strategy = tf.contrib.distribute.MirroredStrategy(cross_tower_ops=cross_tower_ops)

After 2.0, you only need to use tf.distribute! Here is an example setting up an Xception model with 2 GPUs:

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"], 
                                          cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
with strategy.scope():
    parallel_model = Xception(weights=None,
                              input_shape=(299, 299, 3),
                              classes=number_of_classes)
    parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
like image 159
Austin Avatar answered Nov 15 '22 08:11

Austin