Accuracy no longer improving after switching to Dataset

Tags:

I recently trained a binary image classifier and ended up with a model which was around 97.8% accurate. I created this classifier by following a couple of official Tensorflow guides, namely:

https://www.tensorflow.org/tutorials/images/classification
https://www.tensorflow.org/tutorials/load_data/images

I noticed while training (on a GTX 1080) that each epoch was taking around 30 seconds to run. Further reading revealed that a better way to load data into a Tensorflow training run is by using a Dataset. So I updated my code to load the images into a dataset and then have them read by the model.fit_generator method.

Now when I perform my training I find that my accuracy and loss metrics are static - even with the learning rate changing automatically over time. The output looks something like this:

loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000

Given that I'm training a binary classifier an accuracy of 50% is the same as guessing, so I'm wondering if there's a problem with the way I'm providing the images, or perhaps with the size of the dataset.

My image data is split like this:

training/
        true/  (366 images)
        false/ (354 images)

validation/
        true/  (175 images)
        false/ (885 images)

I was using ImageDataGenerator before with various mutations being performed, therefore increasing the overall dataset. Is my problem with the size of my dataset?

The application code I'm using is as follows:

import math

import tensorflow as tf
import os

from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import EarlyStopping

import helpers
import settings

AUTOTUNE = tf.data.experimental.AUTOTUNE

assert tf.test.is_built_with_cuda()
assert tf.test.is_gpu_available()

# Collect the list of training files and process their paths.
training_dataset_files = tf.data.Dataset.list_files(os.path.join(settings.TRAINING_DIRECTORY, '*', '*.png'))
training_dataset_labelled = training_dataset_files.map(helpers.process_path, num_parallel_calls=AUTOTUNE)
training_dataset = helpers.prepare_for_training(training_dataset_labelled)

# Collect the validation files.
validation_dataset_files = tf.data.Dataset.list_files(os.path.join(settings.VALIDATION_DIRECTORY, '*', '*.png'))
validation_dataset_labelled = validation_dataset_files.map(helpers.process_path, num_parallel_calls=AUTOTUNE)
validation_dataset = helpers.prepare_for_training(validation_dataset_labelled)

model = tf.keras.models.Sequential([
    # This is the first convolution
    tf.keras.layers.Conv2D(16, (3, 3), activation='relu', input_shape=(settings.TARGET_IMAGE_HEIGHT, settings.TARGET_IMAGE_WIDTH, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    # The second convolution
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    # The third convolution
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    # The fourth convolution
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    # The fifth convolution
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    # Flatten the results to feed into a DNN
    tf.keras.layers.Flatten(),
    # 512 neuron hidden layer
    tf.keras.layers.Dense(512, activation='relu'),
    # Only 1 output neuron. It will contain a value from 0-1 where 0 for 1 class ('false') and 1 for the other ('true')
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

model.compile(
    loss='binary_crossentropy',
    optimizer=RMSprop(lr=0.1),
    metrics=['acc']
)

callbacks = [
    # EarlyStopping(patience=4),
    tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_acc',
        patience=2,
        verbose=1,
        factor=0.5,
        min_lr=0.00001
    ),
    tf.keras.callbacks.ModelCheckpoint(
        # Path where to save the model
        filepath=settings.CHECKPOINT_FILE,
        # The two parameters below mean that we will overwrite
        # the current checkpoint if and only if
        # the `val_loss` score has improved.
        save_best_only=True,
        monitor='val_loss',
        verbose=1
    ),
    tf.keras.callbacks.TensorBoard(
        log_dir=settings.LOG_DIRECTORY,
        histogram_freq=1
    )
]

training_dataset_length = tf.data.experimental.cardinality(training_dataset_files).numpy()
steps_per_epoch = math.ceil(training_dataset_length // settings.TRAINING_BATCH_SIZE)

validation_dataset_length = tf.data.experimental.cardinality(validation_dataset_files).numpy()
validation_steps = math.ceil(validation_dataset_length // settings.VALIDATION_BATCH_SIZE)

history = model.fit_generator(
    training_dataset,
    steps_per_epoch=steps_per_epoch,
    epochs=20000,
    verbose=1,
    validation_data=validation_dataset,
    validation_steps=validation_steps,
    callbacks=callbacks,
)

model.save(settings.FULL_MODEL_FILE)

With helpers.py looking like this:

import tensorflow as tf
import settings

AUTOTUNE = tf.data.experimental.AUTOTUNE


def process_path(file_path):
    parts = tf.strings.split(file_path, '\\')
    label = parts[-2] == settings.CLASS_NAMES

    # Read the file and decode the image
    img = tf.io.read_file(file_path)
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, [settings.TARGET_IMAGE_HEIGHT, settings.TARGET_IMAGE_WIDTH])
    return img, label


def prepare_for_training(ds, cache=True, shuffle_buffer_size=10000):
    if cache:
        if isinstance(cache, str):
            ds = ds.cache(cache)
        else:
            ds = ds.cache()

    ds = ds.shuffle(buffer_size=shuffle_buffer_size)

    ds = ds.repeat()
    ds = ds.batch(settings.TRAINING_BATCH_SIZE)
    ds = ds.prefetch(buffer_size=AUTOTUNE)

    return ds

A larger snippet of application output is as follows:

21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00207: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 247ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 208/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00208: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 248ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 209/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00209: val_loss did not improve from 7.71247
22/22 [==============================] - 6s 251ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 210/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00210: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 242ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 211/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00211: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 246ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 212/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00212: val_loss did not improve from 7.71247
22/22 [==============================] - 6s 252ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 213/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00213: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 242ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 214/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00214: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 241ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 215/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00215: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 247ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 216/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00216: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 248ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 217/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00217: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 249ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 218/20000
21/22 [===========================>..] - ETA: 0s - loss: 7.7125 - acc: 0.5000
Epoch 00218: val_loss did not improve from 7.71247
22/22 [==============================] - 5s 244ms/step - loss: 7.7125 - acc: 0.5000 - val_loss: 7.7125 - val_acc: 0.5000
Epoch 219/20000
19/22 [========================>.....] - ETA: 0s - loss: 7.7125 - acc: 0.5000

428

asked Nov 06 '19 22:11

Daniel Samuels

1 Answers

There are somethings you must check.

Try a few more times, you may have been unlucky with the 'relu' activations (if one layer goes to all zeros, you're stuck forever).
Take a x, y pair from the dataset and verify that y is within 0 and 1 (because you're using 'sigmoid').

These two are the most troublesome and probable things.

Later you might want to check whether x from the dataset is within the same range you trained before (not crucial, but might change a little the performance), if the number of channels are the same, etc.

For the relus, there are solutions like this one.

177

answered Oct 11 '22 03:10

Daniel Möller

Related questions
                            
                                How to apply the describe function after grouping a PySpark DataFrame?
                            
                                "SyntaxError: Generator expression must be parenthesized" when trying to use conda
                            
                                How does pandas treat timezone when reading from a CSV file?
                            
                                How to get nodes coordinates in graph rendered with graphviz
                            
                                How to fix "Fatal error in launcher: Unable to create process using *path*/scrapy.exe" in anaconda? [duplicate]
                            
                                Django sessions: How to include session ID in every log record?
                            
                                Fill up missing datetime with NaN or supress straight line in line plot
                            
                                How to get the python import tree [closed]
                            
                                Celery Redis instance filling up despite queue looking empty
                            
                                Celery - How to update the state of a task after the worker's being shutdown?
                            
                                "Module not found" when importing a Python package within a plpython3u procedure
                            
                                Submit a Form in a Single Page Application using Flask Without Reloading
                            
                                Keras Autoencoder: Tying Weights from Encoder To Decoder not working
                            
                                String Matching with wildcard in Python
                            
                                TensorFlow 2.0 Keras layers with custom tensors as variables
                            
                                Read timed out. error while sending a POST request to a node.js API
                            
                                How do I send an embed message that contains multiple links parsed from a website to a webhook?
                            
                                pandas merge_asof: ambiguous argument types error
                            
                                I happen to stumble upon this code :" With for w in words:, the example would attempt to create an infinite list
                            
                                Install dependencies in Azure Functions with apt-get WITHOUT Docker

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Accuracy no longer improving after switching to Dataset

Tags:

python

machine-learning

tensorflow

deep-learning

keras

Daniel Samuels

People also ask

1 Answers

Daniel Möller

Recent Activity

Donate For Us