Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras model training memory leak

I'm new with Keras, Tensorflow, Python and I'm trying to build a model for personal use/future learning. I've just started with python and I came up with this code (with help of videos and tutorials). My problem is that my memory usage of Python is slowly creeping up with each epoch and even after constructing new model. Once the memory is at 100% the training just stop with no error/warning. I don´t know too much but the issue should be somewhere within the loop (If I´m not mistaken). I know about

k.clear.session()

but either the issue was not removed or I don´t know how to integrate it in my code. I have: Python v 3.6.4, Tensorflow 2.0.0rc1 (cpu version), Keras 2.3.0

This is my code:

import pandas as pd
import os
import time
import tensorflow as tf
import numpy as np
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint

EPOCHS = 25
BATCH_SIZE = 32           

df = pd.read_csv("EntryData.csv", names=['1SH5', '1SHA', '1SA5', '1SAA', '1WH5', '1WHA',
                                         '2SA5', '2SAA', '2SH5', '2SHA', '2WA5', '2WAA',
                                         '3R1', '3R2', '3R3', '3R4', '3R5', '3R6',
                                         'Target'])

df_val = 14554 

validation_df = df[df.index > df_val]
df = df[df.index <= df_val]

train_x = df.drop(columns=['Target'])
train_y = df[['Target']]
validation_x = validation_df.drop(columns=['Target'])
validation_y = validation_df[['Target']]

train_x = np.asarray(train_x)
train_y = np.asarray(train_y)
validation_x = np.asarray(validation_x)
validation_y = np.asarray(validation_y)

train_x = train_x.reshape(train_x.shape[0], 1, train_x.shape[1])
validation_x = validation_x.reshape(validation_x.shape[0], 1, validation_x.shape[1])

dense_layers = [0, 1, 2]
layer_sizes = [32, 64, 128]
conv_layers = [1, 2, 3]

for dense_layer in dense_layers:
    for layer_size in layer_sizes:
        for conv_layer in conv_layers:
            NAME = "{}-conv-{}-nodes-{}-dense-{}".format(conv_layer, layer_size, 
                    dense_layer, int(time.time()))
            tensorboard = TensorBoard(log_dir="logs\{}".format(NAME))
            print(NAME)

            model = Sequential()
            model.add(LSTM(layer_size, input_shape=(train_x.shape[1:]), 
                                       return_sequences=True))
            model.add(Dropout(0.2))
            model.add(BatchNormalization())

            for l in range(conv_layer-1):
                model.add(LSTM(layer_size, return_sequences=True))
                model.add(Dropout(0.1))
                model.add(BatchNormalization())

            for l in range(dense_layer):
                model.add(Dense(layer_size, activation='relu'))
                model.add(Dropout(0.2))

            model.add(Dense(2, activation='softmax'))

            opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)

            # Compile model
            model.compile(loss='sparse_categorical_crossentropy',
                          optimizer=opt,
                          metrics=['accuracy'])

            # unique file name that will include the epoch 
            # and the validation acc for that epoch
            filepath = "RNN_Final.{epoch:02d}-{val_accuracy:.3f}"  
            checkpoint = ModelCheckpoint("models\{}.model".format(filepath, 
                         monitor='val_acc', verbose=0, save_best_only=True, 
                         mode='max')) # saves only the best ones

            # Train model
            history = model.fit(
                train_x, train_y,
                batch_size=BATCH_SIZE,
                epochs=EPOCHS,
                validation_data=(validation_x, validation_y),
                callbacks=[tensorboard, checkpoint])

# Score model
score = model.evaluate(validation_x, validation_y, verbose=2)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# Save model
model.save("models\{}".format(NAME))

Also I don´t know If it´s possible to ask 2 problems within 1 question (I don´t want to spam it here with my problems which anyone with any python experience can resolve within a minute), but I also have problem with checkpoint saving. I want to save only the best performing model (1 model per 1 NN specification - number of nodes/layers) but currently it is saved after every epoch. If this is inappropriate to ask I can create another question for this.

Thank you very much for any help.

like image 675
Sly Shark Avatar asked Sep 27 '19 15:09

Sly Shark


People also ask

Is there a memory leak issue with Keras?

There even is another article simply titled Dealing with memory leak issue in Keras model training and is even mentioned on twitter . What I ended up suspecting is that there are actually many memory leaks from different methods in the code.

Is it possible to clone a keras model?

I also tried cloning the trained model using keras.models.clone (model), and use the cloned model, as in, cloned_model.predict (). After the predict step, I’d delete the cloned model, hoping it will handle this memory build up.

Do not specify the batch_size in keras?

Do not specify the batch_size if your data is in the form of a dataset, generators, or keras.utils.Sequence instances (since they generate batches). verbose: 0 or 1. Verbosity mode. 0 = silent, 1 = progress bar.

What is metrics in keras?

metrics: List of metrics to be evaluated by the model during training and testing. Each of this can be a string (name of a built-in function), function or a tf.keras.metrics.Metric instance. See tf.keras.metrics.


1 Answers

One source of the problem is, a new loop of model = Sequential() does not remove the previous model; it remains built within its TensorFlow graph scope, and every new model = Sequential() adds another lingering construction which eventually overflows memory. To ensure a model is properly destroyed in full, run below once you're done with a model:

import gc
del model
gc.collect()
K.clear_session()
tf.compat.v1.reset_default_graph() # TF graph isn't same as Keras graph

gc is Python's garbage collection module, which clears remnant traces of model after del. K.clear_session() is the main call, and clears the TensorFlow graph.

Also, while your idea for model checkpointing, logging, and hyperparameter search is quite sound, it's quite faultily executed; you will actually be testing only one hyperparameter combination for the entire nested loop you've set up there. But this should be asked in a separate question.


UPDATE: just encountered the same problem, on a fully properly setup environment; the likeliest conclusion is, it's a bug - and a definite culprit is Eager execution. To work around, use

tf.compat.v1.disable_eager_execution() # right after `import tensorflow as tf`

to switch to Graph mode, which can also run significantly faster. Also see updated clear code above.

like image 118
OverLordGoldDragon Avatar answered Oct 17 '22 12:10

OverLordGoldDragon