I have an 8 GPU cluster and when I run a piece of Tensorflow code from Kaggle (pasted below), it only utilizes a single GPU instead of all 8. I confirmed this using nvidia-smi.
# Set some parameters
IMG_WIDTH = 256
IMG_HEIGHT = 256
IMG_CHANNELS = 3
TRAIN_IM = './train_im/'
TRAIN_MASK = './train_mask/'
TEST_PATH = './test/'
warnings.filterwarnings('ignore', category=UserWarning, module='skimage')
num_training = len(os.listdir(TRAIN_IM))
num_test = len(os.listdir(TEST_PATH))
# Get and resize train images
X_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
Y_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool)
print('Getting and resizing train images and masks ... ')
sys.stdout.flush()
#load training images
for count, filename in tqdm(enumerate(os.listdir(TRAIN_IM)), total=num_training):
    img = imread(os.path.join(TRAIN_IM, filename))[:,:,:IMG_CHANNELS]
    img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True)
    X_train[count] = img
    name, ext = os.path.splitext(filename)
    mask_name = name + '_mask' + ext
    mask = cv2.imread(os.path.join(TRAIN_MASK, mask_name))[:,:,:1]
    mask = resize(mask, (IMG_HEIGHT, IMG_WIDTH))
    Y_train[count] = mask
# Check if training data looks all right
ix = random.randint(0, num_training-1)
print(ix)
imshow(X_train[ix])
plt.show()
imshow(np.squeeze(Y_train[ix]))
plt.show()
# Define IoU metric
def mean_iou(y_true, y_pred):
    prec = []
    for t in np.arange(0.5, 1.0, 0.05):
        y_pred_ = tf.to_int32(y_pred > t)
        score, up_opt = tf.metrics.mean_iou(y_true, y_pred_, 2)
        K.get_session().run(tf.local_variables_initializer())
        with tf.control_dependencies([up_opt]):
            score = tf.identity(score)
        prec.append(score)
    return K.mean(K.stack(prec), axis=0)
# Build U-Net model
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
width = 64
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)
c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)
c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)
c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (c5)
u6 = Conv2DTranspose(width*8, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c6)
u7 = Conv2DTranspose(width*4, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c7)
u8 = Conv2DTranspose(width*2, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c8)
u9 = Conv2DTranspose(width, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (c9)
outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)
model = Model(inputs=[inputs], outputs=[outputs])
sgd = optimizers.SGD(lr=0.03, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss='binary_crossentropy', metrics=[mean_iou])
model.summary()
    
# Fit model
earlystopper = EarlyStopping(patience=20, verbose=1)
checkpointer = ModelCheckpoint('nuclei_only.h5', verbose=1, save_best_only=True)
results = model.fit(X_train, Y_train, validation_split=0.05, batch_size = 32, verbose=1, epochs=100, 
                callbacks=[earlystopper, checkpointer])
I would like to use mxnet or some other method to run this code on all available GPUs. However, I'm not sure how to do this. All the resources only show how to do this on mnist data set. I have my own data set that I am reading differently. Hence, not quite sure how to amend the code.
TL;DR: Use tf.distribute.MirroredStrategy() as a scope, like
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    [...create model as you would otherwise...]
If you do not specify any arguments, tf.distribute.MirroredStrategy() will use all available GPUs. You can also specify which ones to use if you want, like this: mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"]).
Refer to this Distributed training with TensorFlow guide for implementation details and other strategies.
Earlier answer (now outdated: deprecated, removed as of April 1, 2020.): 
Use multi_gpu_model() from Keras. ()
TensorFlow 2.0 now has the tf.distribute module, "a library for running a computation across multiple devices". It builds on the concept of "distribution strategies". You can specify the distribution strategy and then use it as a scope. TensorFlow will split the input, parallelize the calculations, and join the outputs for you basically transparently. Backpropagation is also subject to this. Since all processing is now done behind the scenes, you might want to familiarize yourself with the available strategies and their parameters as they might affect the speed of your training a lot. For example, do you want variables to reside on the CPU? Then use tf.distribute.experimental.CentralStorageStrategy(). Refer to the Distributed training with TensorFlow guide for more info.
Earlier answer (now outdated, leaving it here for reference):
From the Tensorflow Guide:
If you have more than one GPU in your system, the GPU with the lowest ID will be selected by default.
If you want to use multiple GPUs, unfortunately you have to manually specify what tensors to put on each GPU like
with tf.device('/device:GPU:2'):
More info in the Tensorflow Guide Using Multiple GPUs.
In terms of how to distribute your network over the multiple GPUs, there are two main ways of doing that.
You distribute your network layer-wise over the GPUs. This is easier to implement but will not yield a lot of performance benefit because the GPUs will wait for each other to complete the operation.
You create separate copies of your network, called "towers" on each GPU. When you feed the octuple network, you break up you input batch into 8 parts, and distribute them. Let the network forward propagate, then sum the gradients, and do the backward propagation. This will result in an almost-linear speedup with the number of GPUs. It's much more difficult to implement, however, because you also have to deal with complexities related to batch normalization, and very advisable to make sure you randomize your batch properly. There is a nice tutorial here. You should also review the Inception V3 code referenced there for ideas how to structure such a thing. Especially _tower_loss(), _average_gradients() and the part of train() starting with for i in range(FLAGS.num_gpus):.
In case you want to give Keras a try, it now has simplified multi-gpu training significantly with multi_gpu_model(). It can do all the heavy lifting for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With