Resume Training tf.keras Tensorboard

Tags:

I encountered some problems when I continued training my model and visualized the progress on tensorboard.

Tensorboard Training Visualization

My question is how do I resume training from the same step without specifying any epoch manually? If possible, simply by loading the saved model, it somehow could read the global_step from the optimizer saved and continue training from there.

I have provided some codes below to reproduce similar errors.

import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.models import load_model

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
model.save('./final_model.h5', include_optimizer=True)

del model

model = load_model('./final_model.h5')
model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])

You can run the tensorboard by using the command:

tensorboard --logdir ./logs

826

asked Mar 06 '19 09:03

Hardian Lawi

3 Answers

You can set the parameter initial_epoch in the function model.fit() to the number of the epoch you want your training to start from. Take into account that the model trains until the epoch of index epochs is reached (and not a number of iterations given by epochs). In your example, if you want to train for 10 epochs more, it should be:

model.fit(x_train, y_train, initial_epoch=9, epochs=19, callbacks=[Tensorboard()])

It will allow you to visualise your plots on Tensorboard in a correct manner. More extensive information about these parameters can be found in the docs.

146

answered Oct 25 '22 13:10

melaanya

It's very simple. Create checkpoints while training the model and then use those checkpoints to resume training from where you left of.

import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(512, activation=tf.nn.relu),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
model.save('./final_model.h5', include_optimizer=True)

model = load_model('./final_model.h5')

callbacks = list()

tensorboard = Tensorboard()
callbacks.append(tensorboard)

file_path = "model-{epoch:02d}-{loss:.4f}.hdf5"

# now here you can create checkpoints and save according to your need
# here period is the no of epochs after which to save the model every time during training
# another option is save_weights_only, for your case it should be false
checkpoints = ModelCheckpoint(file_path, monitor='loss', verbose=1, period=1, save_weights_only=False)
callbacks.append(checkpoints)

model.fit(x_train, y_train, epochs=10, callbacks=callbacks)

After this just load the checkpoint from where you want to resume training again

model = load_model(checkpoint_of_choice)
model.fit(x_train, y_train, epochs=10, callbacks=callbacks)

And you are done.

Let me know if you have more questions about this.

answered Oct 25 '22 11:10

Abhinav Anand

Here is sample code in case someone needs it. It implements the idea proposed by Abhinav Anand:

mca = ModelCheckpoint(join(dir, 'model_{epoch:03d}.h5'),
                      monitor = 'loss',
                      save_best_only = False)
tb = TensorBoard(log_dir = join(dir, 'logs'),
                 write_graph = True,
                 write_images = True)
files = sorted(glob(join(fold_dir, 'model_???.h5')))
if files:
    model_file = files[-1]
    initial_epoch = int(model_file[-6:-3])
    print('Resuming using saved model %s.' % model_file)
    model = load_model(model_file)
else:
    model = nn.model()
    initial_epoch = 0
model.fit(x_train,
          y_train,
          epochs = 100,
          initial_epoch = initial_epoch,
          callbacks = [mca, tb])

Replace nn.model() with your own function for defining the model.

answered Oct 25 '22 11:10

Björn Lindqvist

Related questions
                            
                                Flask ImportError: cannot import name app
                            
                                How to apply a condition to pandas iloc
                            
                                Why Doc2vec gives 2 different vectors for the same texts
                            
                                Start CloudSQL Proxy on Python Dataflow / Apache Beam
                            
                                Custom weight initialization in PyTorch
                            
                                Individual axes limits for pairplot in python
                            
                                Prune unnecessary leaves in sklearn DecisionTreeClassifier
                            
                                Using numpy.vstack in numba
                            
                                K-means using only specific dataframe columns with scikit-learn
                            
                                How to combine multiple rows into a single row with python pandas based on the values of multiple columns?
                            
                                Why is Keras LSTM on CPU three times faster than GPU?
                            
                                Cycling values of a list [duplicate]
                            
                                How to disable pytest dumping out source code?
                            
                                ValueError: must have exactly one of create/read/write/append mode
                            
                                How to run tasks concurrently in asyncio?
                            
                                Numpy get index of row with second-largest value
                            
                                How to handle strange Pandas error "unable to open hashtable..."
                            
                                How do/can I generate a PKCS#12 file using python and the cryptography module?
                            
                                How Can I Update a Qml Object's Property from my Python file?
                            
                                How to pass environment variables from SAM cli to Lambda function code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Resume Training tf.keras Tensorboard

Tags:

python

machine-learning

tensorflow

keras

tensorboard