How to correct unstable loss and accuracy during training? (binary classification)

Tags:

I am currently working on a small binary classification project using the new keras API in tensorflow. The problem is a simplified version of the Higgs Boson challenge posted on Kaggle.com a few years back. The dataset shape is 2000x14, where the first 13 elements of each row form the input vector, and the 14th element is the corresponding label. Here is a sample of said dataset:

86.043,52.881,61.231,95.475,0.273,77.169,-0.015,1.856,32.636,202.068, 2.432,-0.419,0.0,0
138.149,69.197,58.607,129.848,0.941,120.276,3.811,1.886,71.435,384.916,2.447,1.408,0.0,1
137.457,3.018,74.670,81.705,5.954,775.772,-8.854,2.625,1.942,157.231,1.193,0.873,0.824,1

I am relatively new to machine learning and tensorflow, but I am familiar with the higher level concepts such as loss functions, optimizers and activation functions. I have tried building various models inspired by examples of binary classification problems found online, but I am having difficulties with training the model. During training, the loss somethimes increases within the same epoch, leading to unstable learning. The accuracy hits a plateau around 70%. I have tried changing the learning rate and other hyperparameters but to no avail. In comparison, I have hardcoded a fully-connected feed forward neural net that reaches around 80-85% accuracy on the same problem.

Here is my current model:

import tensorflow as tf
from tensorflow.python.keras.layers.core import Dense
import numpy as np
import pandas as pd

def normalize(array):
    return array/np.linalg.norm(array, ord=2, axis=1, keepdims=True)

x_train = pd.read_csv('data/labeled.csv', sep='\s+').iloc[:1800, :-1].values
y_train = pd.read_csv('data/labeled.csv', sep='\s+').iloc[:1800, -1:].values

x_test = pd.read_csv('data/labeled.csv', sep='\s+').iloc[1800:, :-1].values
y_test = pd.read_csv('data/labeled.csv', sep='\s+').iloc[1800:, -1:].values

x_train = normalize(x_train)
x_test = normalize(x_test)

model = tf.keras.Sequential()
model.add(Dense(9, input_dim=13, activation=tf.nn.sigmoid)
model.add(Dense(6, activation=tf.nn.sigmoid))
model.add(Dense(1, activation=tf.nn.sigmoid))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=50)
model.evaluate(x_test, y_test)

As mentionned, some of the epochs start with a higher accuracy than they finish with, leading to unstable learning.

  32/1800 [..............................] - ETA: 0s - loss: 0.6830 - acc: 0.5938
1152/1800 [==================>...........] - ETA: 0s - loss: 0.6175 - acc: 0.6727
1800/1800 [==============================] - 0s 52us/step - loss: 0.6098 - acc: 0.6861
Epoch 54/250

  32/1800 [..............................] - ETA: 0s - loss: 0.5195 - acc: 0.8125
1376/1800 [=====================>........] - ETA: 0s - loss: 0.6224 - acc: 0.6672
1800/1800 [==============================] - 0s 43us/step - loss: 0.6091 - acc: 0.6850
Epoch 55/250

What could be the cause of these oscillations in learning in such a simple model? Thanks

EDIT:

I have followed some suggestions from the comments and have modified the model accordingly. It now looks more like this:

model = tf.keras.Sequential()
model.add(Dense(250, input_dim=13, activation=tf.nn.relu))
model.add(Dropout(0.4))
model.add(Dense(200, activation=tf.nn.relu))
model.add(Dropout(0.4))
model.add(Dense(100, activation=tf.nn.relu))
model.add(Dropout(0.3))
model.add(Dense(50, activation=tf.nn.relu))
model.add(Dense(1, activation=tf.nn.sigmoid))

model.compile(optimizer='adadelta',
              loss='binary_crossentropy',
              metrics=['accuracy'])

660

asked Apr 28 '19 20:04

Mustfled

1 Answers

Oscillations

Those are most definitely connected to the size of your network; each batch coming through changes your neural network considerably as it does not have enough neurons to represent the relationships.

It works fine for one batch, updates the weights for another and changes previously learned connections effectively "unlearning". That's why the loss is also jumpy as the network tries to accommodate to the task you have given it.

Sigmoid activation and it's saturation may be causing you troubles as well (as the gradient is squashed into small region and most gradient updates are zero). Quick fix - use ReLU activation as described below.

Additionally, neural network does not care about accuracy, only about minimizing the loss value (which it tries to do most of the time). Say it predicts probabilities: [0.55, 0.55, 0.55, 0.55, 0.45] for classes [1, 1, 1, 1, 0] so it's accuracy is 100% but it's pretty uncertain. Now, let's say the next update pushes the network into probabilities predictions: [0.8, 0.8, 0.8, 0.8, 0.55]. In such case, loss would drop, but so would accuracy, from 100% to 80%.

BTW. You may want to check the scores for logistic regression and see how it performs on this task (so a single layer with output only).

Some things to consider

1. Size of your neural network

It's always good to start with simple model and grow it bigger if needed (wouldn't advise the other way around). You may want to check on a really small subsample of data (say two/three batches, 160 elements or so) whether your model can learn the relationship between input and output.

In your case I doubt the model will be able to learn those relationships with the size of layers you are providing. Try increasing the size, especially in the earlier layers (maybe 50/100 for starters) and see how it behaves.

2. Activation function

Sigmoid easily saturates (small region where changes occur, most of the values are almost 0 or 1). It is rarely used nowadays as activation before bottleneck (final layer). Most common nowadays is ReLU which is not prone to saturation (at least when the input is positive) or it's variations. This might help as well.

3. Learning rate

For each dataset and each neural network model optimal choice of learning rate is different. Defaults usually work so-so, but when the learning rate is too small it might get stuck in the local minima (and it's generalization will be worse), while the value being too big will make your network unstable (loss will highly oscillate).

You may want to read up on Cyclical Learning Rate (or in the original research paper by Leslie N. Smith. In there you can find info on how to choose a good learning rate heuristically and setup some simple learning rate schedulers. Those techniques were used by fast.ai teams in CIFAR10 competitions with really good results. On their site or in documentation of their library you can find One Cycle Policy and learning rate finder (based on the work of aforementioned researcher). This should get you started in this realm I think.

4. Normalization

Not sure, but this normalization looks pretty non-standard to me (never seen it done like that). Good normalization is the basis for neural network convergence (unless the data is already pretty close to normal distribution). Usually one subtracts the mean and divides by standard deviation for each feature. You can check some schemes in scikit-learn library for example.

5. Depth

This shouldn't be an issue but if your input is complicated you should consider adding more layers to your neural network (right now it's almost definitely too thin). This would allow it to learn more abstract features and transform the input space more.

Overfitting

When the network overfits to the data you may employ some regularization techniques (hard to tell what might help, you should test it on your own), some of those include:

Higher learning rate with batch normalization smoothing out learning space.
Smaller number of neurons (relationships learned by the network would intuitively have to be more data distribution representative).
Smaller batch size have regularization effect as well.
Dropout, though it's hard to pin-point good dropout rate. Would resort to it as the last one. Furthermore it is known to collide with batch normalization techniques (though there are techniques to combine them, see here or here, you may find more over the web).
L1/L2 regularization with the second being much more widely applied (unless you have specific knowledge indicating L1 might perform better)
Data augmentation - I would try this one first, mostly because of curiosity. As your features are continuous you may want to add some random noise on batch-to-batch basis generated from gaussian distribution. Noise would have to be small, standard deviation around 1e-2 or 1e-3, you would have to test those values experimentally.
Early stopping - after N epochs without improvement on the validation set you end your training. Pretty common technique, should be used almost every time. Remember to save the best model on validation set and set patience (N mentioned above) to some moderately sized value (do not set patience to 1 epoch or so, neural network may easily improve after 5 or so).

Plus there are tons of other techniques you may find. Check what makes intuitive sense and which one you like the most and test how it performs.

106

answered Oct 29 '22 11:10

Szymon Maszke

Related questions
                            
                                Flask: Want to import file of helper functions
                            
                                How can I write a binary array as an image in Python?
                            
                                Flask-Login raises TypeError: 'bool' object is not callable when trying to override is_active property
                            
                                Edit and create HTML file using Python
                            
                                Determine if a day is a business day in Python / Pandas
                            
                                "CSV file does not exist" for a filename with embedded quotes
                            
                                Concat string if condition, else do nothing
                            
                                Display all information with data.info() in Pandas
                            
                                How does spacy lemmatizer works?
                            
                                pytest doesn't run any test
                            
                                Django - getting Error "Reverse for 'detail' with no arguments not found. 1 pattern(s) tried:" when using {% url "music:fav" %}
                            
                                Celery, calling delay with countdown
                            
                                UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position X: character maps to <undefined>
                            
                                Python/Selenium - Clear the cache and cookies in my chrome webdriver?
                            
                                ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score
                            
                                ValueError when trying to have multi-index in DataFrame.pivot
                            
                                Featuretools: Can it be applied on a single table to generate features even when there is no datetime related column?
                            
                                Airflow Generate Dynamic Tasks in Single DAG , Task N+1 is Dependent on TaskN
                            
                                How to update Python pip?
                            
                                How to get a list of the name of every open window?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With