Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this tensorflow training taking so long?

I'm learning DRL with the book Deep Reinforcement Learning in Action. In chapter 3, they present the simple game Gridworld (instructions here, in the rules section) with the corresponding code in PyTorch.

I've experimented with the code and it takes less than 3 minutes to train the network with 89% of wins (won 89 of 100 games after training).

Training loss with pytorch

As an exercise, I have migrated the code to tensorflow. All the code is here.

The problem is that with my tensorflow port it takes near 2 hours to train the network with a win rate of 84%. Both versions are using the only CPU to train (I don't have GPU)

Training loss with tensorflow

Training loss figures seem correct and also the rate of a win (we have to take into consideration that the game is random and can have impossible states). The problem is the performance of the overall process.

I'm doing something terribly wrong, but what?

The main differences are in the training loop, in torch is this:

        loss_fn = torch.nn.MSELoss()
        learning_rate = 1e-3
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
        ....
        Q1 = model(state1_batch) 
        with torch.no_grad():
            Q2 = model2(state2_batch) #B
        
        Y = reward_batch + gamma * ((1-done_batch) * torch.max(Q2,dim=1)[0])
        X = Q1.gather(dim=1,index=action_batch.long().unsqueeze(dim=1)).squeeze()
        loss = loss_fn(X, Y.detach())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

and in the tensorflow version:

        loss_fn = tf.keras.losses.MSE
        learning_rate = 1e-3
        optimizer = tf.keras.optimizers.Adam(learning_rate)
        ...
        Q2 = model2(state2_batch) #B
        with tf.GradientTape() as tape:
            Q1 = model(state1_batch)
            Y = reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))
            X = [Q1[i][action_batch[i]] for i in range(len(action_batch))]
            loss = loss_fn(X, Y)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

Why is the training taking so long?

like image 767
Ivan Avatar asked May 04 '21 10:05

Ivan


People also ask

How long does TensorFlow take to train?

Training usually takes between 2-8 hours depending on the number of files and queued models for training.

How do I speed up TensorFlow training?

To optimize training speed, you want your GPUs to be running at 100% speed. nvidia-smi is nice to make sure your process is running on the GPU, but when it comes to GPU monitoring, there are smarter tools out there.

Why is TensorFlow training so slow?

Most slowness caused but creating not optimized read pipline, and most of the time network just wait read from disk, whether to process data. For this reason tensorflow created special files format like TFRecords to lower disk read time. And also for this reason part of the training code should be processed on CPU.

Is TensorFlow a good library for deep learning?

Tensorflow is a really great library for deep learning. It has good support for GPU acceleration. It currently only supports CUDA acceleration, So bad news if you have an AMD card you have to rely on CPU to run models. And CPU's aren't made for that parallel load so it will be slow.

Why is TensorFlow faster than Caffe when using GPU?

TensorFlow is faster when using GPU and cuDNN >= 7, by using batched SGEMM. Actually, in my recent research, Tensorflow showed to be faster than Caffe — a framework always fastest on older papers — when using GPUs and cuDNN >= 7, by using batched SGEMM: multiple tiny matrix multiplications in parallel

Is TensorFlow product-oriented or service-oriented?

As TensorFlow was used by Google for so long, it is very easy to deploy algorithms using it. So you can think about it as more product oriented. Logically, you want to be able to deploy the algorithms that you are creating (you can check out TensorFlow Serving [ 2] for more on that).


Video Answer


1 Answers

Why is TensorFlow slow

TensorFlow has 2 execution modes: eager execution, and graph mode. TensorFlow default behavior, since version 2, is to default to eager execution. Eager execution is great as it enables you to write code close to how you would write standard python. It's easier to write, and it's easier to debug. Unfortunately, it's really not as fast as graph mode.

So the idea is, once the function is prototyped in eager mode, to make TensorFlow execute it in graph mode. For that you can use tf.function. tf.function compiles a callable into a TensorFlow graph. Once the function is compiled into a graph, the performance gain is usually quite important. The recommended approach when developing in TensorFlow is the following:

  • Debug in eager mode, then decorate with @tf.function.
  • Don't rely on Python side effects like object mutation or list appends.
  • tf.function works best with TensorFlow ops; NumPy and Python calls are converted to constants.

I would add: think about the critical parts of your program, and which ones should be converted first into graph mode. It's usually the parts where you call a model to get a result. It's where you will see the best improvements.

You can find more information in the following guides:

  • Better performance with tf.function
  • Introduction to graphs and tf.function

Applying tf.function to your code

So, there are at least two things you can change in your code to make it run quite faster:

  1. The first one is to not use model.predict on a small amount of data. The function is made to work on a huge dataset or on a generator. (See this comment on Github). Instead, you should call the model directly, and for performance enhancement, you can wrap the call to the model in a tf.function.

Model.predict is a top-level API designed for batch-predicting outside of any loops, with the fully-features of the Keras APIs.

  1. The second one is to make your training step a separate function, and to decorate that function with @tf.function.

So, I would declare the following things before your training loop:

# to call instead of model.predict
model_func = tf.function(model)

def get_train_func(model, model2, loss_fn, optimizer):
    """Wrapper that creates a train step using the two model passed"""
    @tf.function
    def train_func(state1_batch, state2_batch, done_batch, reward_batch, action_batch):
        Q2 = model2(state2_batch) #B
        with tf.GradientTape() as tape:
            Q1 = model(state1_batch)
            Y = reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))
            # gather is more efficient than a list comprehension, and needed in a tf.function
            X = tf.gather(Q1, action_batch, batch_dims=1)
            loss = loss_fn(X, Y)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        return loss
    return train_func

# train step is a callable 
train_step = get_train_func(model, model2, loss_fn, optimizer)

And you can use that function in your training loop:

if len(replay) > batch_size:
    minibatch = random.sample(replay, batch_size)
    state1_batch = np.array([s1 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))
    action_batch = np.array([a for (s1,a,r,s2,d) in minibatch])   #TODO: Posibles diferencies
    reward_batch = np.float32([r for (s1,a,r,s2,d) in minibatch])
    state2_batch = np.array([s2 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))
    done_batch = np.array([d for (s1,a,r,s2,d) in minibatch]).astype(np.float32)

    loss = train_step(state1_batch, state2_batch, done_batch, reward_batch, action_batch)
    losses.append(loss)

There are other changes that you could make to make your code more TensorFlowesque, but with those modifications, your code takes ~2 minutes on my CPU. (with a 97% win rate).

like image 171
Lescurel Avatar answered Oct 19 '22 00:10

Lescurel