Save Keras ModelCheckpoints in Google Cloud Bucket

Tags:

I'm working on training a LSTM network on Google Cloud Machine Learning Engine using Keras with TensorFlow backend. I managed it to deploy my model and perform a successful training task after some adjustments to the gcloud and my python script.

I then tried to make my model save checkpoints after every epoch using Keras modelCheckpoint callback. Running a local training job with Google Cloud works perfectly as expected. The weights are getting stored in the specified path after each epoch. But when I try to run the same job online on Google Cloud Machine Learning Engine the weights.hdf5 does not get written to my Google Cloud Bucket. Instead I get the following error:

...
File "h5f.pyx", line 71, in h5py.h5f.open (h5py/h5f.c:1797)
IOError: Unable to open file (Unable to open file: name = 
'gs://.../weights.hdf5', errno = 2, error message = 'no such file or
directory', flags = 0, o_flags = 0)

I investigated this issue and it turned out, that there is no Problem with the the Bucket itself, as Keras Tensorboard callback does work fine and writes the expected output to the same bucket. I also made sure that h5py gets included by providing it in the setup.py located at:

├── setup.py
    └── trainer
    ├── __init__.py
    ├── ...

The actual include in setup.py is shown below:

# setup.py
from setuptools import setup, find_packages

setup(name='kerasLSTM',
      version='0.1',
      packages=find_packages(),
      author='Kevin Katzke',
      install_requires=['keras','h5py','simplejson'],
      zip_safe=False)

I guess the problem comes down to the fact that GCS cannot be accessed with Pythons open for I/O since it instead provides a custom implementation:

import tensorflow as tf
from tensorflow.python.lib.io import file_io

with file_io.FileIO("gs://...", 'r') as f:
    f.write("Hi!")

After checking how Keras modelCheckpoint callback implements the actual file writing and it turned out, that it is using h5py.File() for I/O:

 with h5py.File(filepath, mode='w') as f:
    f.attrs['keras_version'] = str(keras_version).encode('utf8')
    f.attrs['backend'] = K.backend().encode('utf8')
    f.attrs['model_config'] = json.dumps({
        'class_name': model.__class__.__name__,
        'config': model.get_config()
 }, default=get_json_type).encode('utf8')

And as the h5py package is a Pythonic interface to the HDF5 binary data format the h5py.File() seems to call an underlying HDF5 functionality written in Fortran as far as I can tell: source, documentation.

How can I fix this and make the modelCheckpoint callback write to my GCS Bucket? Is there a way for "monkey patching" to somehow overwrite how a hdf5 file is opened to make it use GCS's file_io.FileIO()?

582

asked Aug 09 '17 08:08

Kevin Katzke

2 Answers

I might be a bit late on this, but for the sake of future visitors I would describe the whole process of how to adapt the code that was previously run locally to be GoogleML-aware from the IO point of view.

Python standard open(file_name, mode) does not work with buckets (gs://...../file_name). One needs to from tensorflow.python.lib.io import file_io and change all calls to open(file_name, mode) to file_io.FileIO(file_name, mode=mode) (note the named mode parameter). The interface of the opened handle is the same.
Keras and/or other libraries mostly use standard open(file_name, mode) internally. That said, something like trained_model.save(file_path) calls to 3rd-party libraries will fail to store the result to the bucket. The only way to retrieve a model after the job has finished successfully would be to store it locally and then move to the bucket.

The code below is quite inefficient, because it loads the whole model at once and then dumps it to the bucket, but it worked for me for relatively small models:

model.save(file_path)

with file_io.FileIO(file_path, mode='rb') as if:
    with file_io.FileIO(os.path.join(model_dir, file_path), mode='wb+') as of:
        of.write(if.read())

The mode must be set to binary for both reading and writing.

When the file is relatively big, it makes sense to read and write it in chunks to decrease memory consumption.

Before running a real task, I would advise to run a stub that simply saves a file to remote bucket.

This implementation, temporarily put instead of real train_model call, should do:

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument(
        '--job-dir',
        help='GCS location with read/write access',
        required=True
    )

    args = parser.parse_args()
    arguments = args.__dict__
    job_dir = arguments.pop('job_dir')

    with file_io.FileIO(os.path.join(job_dir, "test.txt"), mode='wb+') as of:
        of.write("Test passed.")

After a successful execution you should see the file test.txt with a content "Test passed." in your bucket.

answered Oct 05 '22 22:10

Yulia

The issue can be solved with the following piece of code:

# Save Keras ModelCheckpoints locally
model.save('model.h5')

# Copy model.h5 over to Google Cloud Storage
with file_io.FileIO('model.h5', mode='r') as input_f:
    with file_io.FileIO('model.h5', mode='w+') as output_f:
        output_f.write(input_f.read())
        print("Saved model.h5 to GCS")

The model.h5 is saved on local filesystem and the copied over to GCS. As Jochen pointed out, there currently is no easy support to write HDF5 model checkpoints to GCS. With this hack it is possible to write the data until an easier solution is provided.

answered Oct 06 '22 00:10

Kevin Katzke

Related questions
                            
                                Why was Eigen chosen for TensorFlow? [closed]
                            
                                ValueError when performing matmul with Tensorflow
                            
                                Add Tensorflow pre-processing to existing Keras model (for use in Tensorflow Serving)
                            
                                Keras - class_weight vs sample_weights in the fit_generator
                            
                                What is Bazel in TensorFlow? When do I need to build again?
                            
                                How do you create a boolean mask for a tensor in Keras?
                            
                                How to run Keras on multiple cores?
                            
                                what does class_mode parameter in Keras image_gen.flow_from_directory() signify?
                            
                                How do I check Bazel version?
                            
                                How to understand masked multi-head attention in transformer
                            
                                Eager Execution - InternalError: Could not find valid device for node name: "Sqrt"
                            
                                How to accumulate gradients for large batch sizes in Keras
                            
                                What is tensorflow.compat.as_str()?
                            
                                Tensorflow: Writing an Op in Python
                            
                                Feeding image data in tensorflow for transfer learning
                            
                                Conditional assignment of tensor values in TensorFlow
                            
                                What's the difference between Variable and ResourceVariable in Tensorflow
                            
                                Keras - Validation Loss and Accuracy stuck at 0
                            
                                UsageError: Line magic function `%tensorflow_version` not found
                            
                                How tf.gradients work in TensorFlow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Save Keras ModelCheckpoints in Google Cloud Bucket

Tags:

hdf5

tensorflow

google-cloud-platform

keras

h5py

Kevin Katzke

People also ask

2 Answers

Yulia

Kevin Katzke

Recent Activity

Donate For Us