After a lot of research, it seems like there is no good way to properly stop and resume training using a Tensorflow 2 / Keras model. This is true whether you are using <code>model.fit()</code> or using a custom training loop. There seem to be 2 supported ways to save a model while training: <ol> <li> Save just the weights of the model, using <code>model.save_weights()</code> or <code>save_weights_only=True</code> with <code>tf.keras.callbacks.ModelCheckpoint</code>. This seems to be preferred by most of the examples I've seen, however it has a number of major issues: <ul> <li>The optimizer state is not saved, meaning training resumption will not be correct.</li> <li>Learning rate schedule is reset - this can be catastrophic for some models.</li> <li>Tensorboard logs go back to step 0 - making logging essentually useless unless complex workarounds are implemented.</li> </ul> </li> <li> Save the entire model, optimizer, etc. using <code>model.save()</code> or <code>save_weights_only=False</code>. The optimizer state is saved (good) but the following issues remain: <ul> <li>Tensorboard logs still go back to step 0</li> <li>Learning rate schedule is still reset (!!!)</li> <li>It is impossible to use custom metrics.</li> <li>This doesn't work at all when using a custom training loop - custom training loops use a non-compiled model, and saving/loading a non-compiled model doesn't seem to be supported.</li> </ul> </li> </ol> The best workaround I've found is to use a custom training loop, manually saving the step. This fixes the tensorboard logging, and the learning rate schedule can be fixed by doing something like <code>keras.backend.set_value(model.optimizer.iterations, step)</code>. However, since a full model save is off the table, the optimizer state is not preserved. I can see no way to save the state of the optimizer independently, at least without a lot of work. And messing with the LR schedule as I've done feels messy as well. Am I missing something? How are people out there saving/resuming using this API?

<code>tf.keras.callbacks.experimental.BackupAndRestore</code> API for resuming training from interruptions has been added for <code>tensorflow>=2.3</code>. It works great in my experience. Reference: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/experimental/BackupAndRestore

Just use the callback function as <pre class="prettyprint"><code>callback = tf.keras.callbacks.experimental.BackupAndRestore( backup_dir="backup_directory") </code></pre>

Keras - no good way to stop and resume training?

Tags:

python

tensorflow

keras

tensorflow2.0

tf.keras

After a lot of research, it seems like there is no good way to properly stop and resume training using a Tensorflow 2 / Keras model. This is true whether you are using model.fit() or using a custom training loop.

There seem to be 2 supported ways to save a model while training:

Save just the weights of the model, using model.save_weights() or save_weights_only=True with tf.keras.callbacks.ModelCheckpoint. This seems to be preferred by most of the examples I've seen, however it has a number of major issues:
- The optimizer state is not saved, meaning training resumption will not be correct.
- Learning rate schedule is reset - this can be catastrophic for some models.
- Tensorboard logs go back to step 0 - making logging essentually useless unless complex workarounds are implemented.
Save the entire model, optimizer, etc. using model.save() or save_weights_only=False. The optimizer state is saved (good) but the following issues remain:
- Tensorboard logs still go back to step 0
- Learning rate schedule is still reset (!!!)
- It is impossible to use custom metrics.
- This doesn't work at all when using a custom training loop - custom training loops use a non-compiled model, and saving/loading a non-compiled model doesn't seem to be supported.

The best workaround I've found is to use a custom training loop, manually saving the step. This fixes the tensorboard logging, and the learning rate schedule can be fixed by doing something like keras.backend.set_value(model.optimizer.iterations, step). However, since a full model save is off the table, the optimizer state is not preserved. I can see no way to save the state of the optimizer independently, at least without a lot of work. And messing with the LR schedule as I've done feels messy as well.

Am I missing something? How are people out there saving/resuming using this API?

667

asked Sep 07 '20 10:09

Daniel

Video Answer

4 Answers

You're right, there isn't builtin support for resumability - which is exactly what motivated me to create DeepTrain. It's like Pytorch Lightning (better and worse in different regards) for TensorFlow/Keras.

Why another library? Don't we have enough? You have nothing like this; if there was, I'd not build it. DeepTrain's tailored for the "babysitting approach" to training: train fewer models, but train them thoroughly. Closely monitor each stage to diagnose what's wrong and how to fix.

Inspiration came from my own use; I'd see "validation spikes" throughout a long epoch, and couldn't afford to pause as it'd restart the epoch or otherwise disrupt the train loop. And forget knowing which batch you were fitting, or how many remain.

How's it compare to Pytorch Lightning? Superior resumability and introspection, along unique train debug utilities - but Lightning fares better in other regards. I have a comprehensive list comparison in working, will post within a week.

Pytorch support coming? Maybe. If I convince the Lightning dev team to make up for its shortcomings relative to DeepTrain, then not - otherwise probably. In the meantime, you can explore the gallery of Examples.

Minimal example:

Click to copy

from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from deeptrain import TrainGenerator, DataGenerator

ipt = Input((16,))
out = Dense(10, 'softmax')(ipt)
model = Model(ipt, out)
model.compile('adam', 'categorical_crossentropy')

dg  = DataGenerator(data_path="data/train", labels_path="data/train/labels.npy")
vdg = DataGenerator(data_path="data/val",   labels_path="data/val/labels.npy")
tg  = TrainGenerator(model, dg, vdg, epochs=3, logs_dir="logs/")

tg.train()

You can KeyboardInterrupt at any time, inspect the model, train state, data generator - and resume.

181

answered Oct 22 '22 01:10

OverLordGoldDragon

tf.keras.callbacks.experimental.BackupAndRestore API for resuming training from interruptions has been added for tensorflow>=2.3. It works great in my experience.

Reference: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/experimental/BackupAndRestore

answered Oct 22 '22 00:10

yanp

tf.keras.callbacks.BackupAndRestore can take care of this.

answered Oct 22 '22 00:10

mehran

Just use the callback function as

Click to copy

callback = tf.keras.callbacks.experimental.BackupAndRestore(
backup_dir="backup_directory")

answered Oct 22 '22 00:10

Moshiur Rahman Faisal

Related questions
                            
                                Can't Install psutil
                            
                                Increasing each element of a tensor by the predecessor in Tensorflow 2.0
                            
                                pymongo.errors.ServerSelectionTimeoutError:localhost:27017:[WinError 10061]No connection could be made because the target machine actively refused it
                            
                                Colorize the background of a seaborn plot using a column in dataframe
                            
                                Split a Python List into Chunks with Maximum Memory Size
                            
                                How can I add an element to a PyTorch tensor along a certain dimension?
                            
                                Renumbering line by line
                            
                                How to display a pandas dataframe within a VBOX using ipywidgets
                            
                                Breaking change for google-api-python-client 1.8.1 - AttributeError: module 'googleapiclient' has no attribute '__version__'
                            
                                Pydantic model for array of jsons
                            
                                Computing `AB⁻¹` with `np.linalg.solve()`
                            
                                Why can I not assign `cls.__hash__ = id`?
                            
                                Tkinter how to bind to shift+tab
                            
                                3D Gridded Data Interpolation in Julia
                            
                                AttributeError: 'tuple' object has no attribute 'rank' when calling fit on a Keras model with custom generator
                            
                                How to get numpy working properly in Anaconda Python 3.7.6
                            
                                How to scrape all topics from twitter
                            
                                What is a good design pattern to combine datasets that are related but stored in different dataframes?
                            
                                Tensorflow-gpu issue (CUDA runtime error: device kernel image is invalid)
                            
                                Prefect how to avoid rerunning a task

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Keras - no good way to stop and resume training?

Tags:

python

tensorflow

keras

tensorflow2.0

tf.keras

Daniel

People also ask

Video Answer

4 Answers

OverLordGoldDragon

yanp

mehran

Moshiur Rahman Faisal

Recent Activity

Donate For Us