Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run Identical model on multiple GPUs, but send different user data to each GPU

Any one have any success with efficient data-parallelism, where you send the identical model definition to multiple GPUs, but send different user data to each GPU?

It looks like dist-keras might be promising. but I would love to hear feedback on any approaches taken along these lines.

We have user behavioral data: 100k users, 200 fields (one-hot vectors), 30,000 records per user. We built an RNN, using Keras on top of Tensorflow, to predict the next action (out of 20+ possible actions) taken for only 1 user. It takes about 30min to train on 1 GPU. (My box has 8 GPUs). Now, We would like to build models for all 100k users.

We were able to perform data parallelism using Multi GPU approach for single user data.

But since the model takes 30 minutes per user, and there are 100k users, we want to partition the data by user and and run the same model for every user data in distributed way using a cluster and generate model output for that user.

I am currently using Keras 2.1.x with TensorFlow 1.4.

like image 973
Balaji Avatar asked Jan 08 '18 07:01

Balaji


People also ask

How can I train a Keras model on multiple GPUs on a single machine )?

There are two ways to run a single model on multiple GPUs, data parallelism and device parallelism. In most cases, what you need is most likely data parallelism. Data parallelism consists of replicating the target model once on each device and using each replica to process a different fraction of the input data.

What is model parallelism?

Model parallelism is a distributed training method in which the deep learning model is partitioned across multiple devices, within or across instances.

What is multi GPU model?

PyTorch Multi GPU PyTorch is an open source scientific computing framework based on Python. You can use it to train machine learning models using tensor computations and GPUs. This framework supports distributed training through the torch.

Does TensorFlow automatically use multiple GPUs?

If you have more than one GPU, the GPU with the lowest ID will be selected by default. However, TensorFlow does not place operations into multiple GPUs automatically. To override the device placement to use multiple GPUs, we manually specify the device that a computation node should run on.


1 Answers

This is not exactly what you are describing, however, something that might work would be to take slices of each batch and train them on the different GPUs separately by taking the model and constructing a seperate one that does this automatically.

So say we want to make the model parallelized, and then split its batches during training among the hardware.

def make_parallel(model, gpu_count):
    """
    make a paralellized model from the input model on the
    given gpu count that splits the input batch amongst the 
    hardware.

    :param model: The model you want to make parallel
    :param gpu_count: The gpu count
    :return: The parellelized model
    """
    def get_slice(data, idx, parts): # take a slice of the batch
        shape = tf.shape(data)
        size = tf.concat([shape[:1] // parts, shape[1:]], axis=0)
        stride = tf.concat([shape[:1] // parts, shape[1:] * 0], axis=0)
        start = stride * idx
        return tf.slice(data, start, size)

    outputs_all = [[] for i in range(len(model.outputs))]

    # Place a copy of the model on each GPU, each getting a slice of the batch
    for i in range(gpu_count):
        with tf.device('/gpu:%d' % i):
            with tf.name_scope('tower_%d' % i) as scope:
                inputs = []
                for x in model.inputs:
                    input_shape = tuple(x.get_shape().as_list())[1:]
                    slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx': i, 'parts': gpu_count})(x)
                    inputs.append(slice_n)

                outputs = model(inputs)

                if not isinstance(outputs, list):
                    outputs = [outputs]

                # Save all outputs to be joined at a later date
                for l in range(len(outputs)):
                    outputs_all[l].append(outputs[l])

    # merge outputs on CPU
    with tf.device('/cpu:0'):
        merged = [merge(output, mode='concat', concat_axis=0) for output in outputs_all]
        return Model(input=model.inputs, output=merged)

Can you report back speed results when training on this model?

like image 149
modesitt Avatar answered Oct 04 '22 01:10

modesitt