I am currently working on a small binary classification project using the new keras API in tensorflow. The problem is a simplified version of the Higgs Boson challenge posted on Kaggle.com a few years back. The dataset shape is 2000x14, where the first 13 elements of each row form the input vector, and the 14th element is the corresponding label. Here is a sample of said dataset:
86.043,52.881,61.231,95.475,0.273,77.169,-0.015,1.856,32.636,202.068, 2.432,-0.419,0.0,0
138.149,69.197,58.607,129.848,0.941,120.276,3.811,1.886,71.435,384.916,2.447,1.408,0.0,1
137.457,3.018,74.670,81.705,5.954,775.772,-8.854,2.625,1.942,157.231,1.193,0.873,0.824,1
I am relatively new to machine learning and tensorflow, but I am familiar with the higher level concepts such as loss functions, optimizers and activation functions. I have tried building various models inspired by examples of binary classification problems found online, but I am having difficulties with training the model. During training, the loss somethimes increases within the same epoch, leading to unstable learning. The accuracy hits a plateau around 70%. I have tried changing the learning rate and other hyperparameters but to no avail. In comparison, I have hardcoded a fully-connected feed forward neural net that reaches around 80-85% accuracy on the same problem.
Here is my current model:
import tensorflow as tf
from tensorflow.python.keras.layers.core import Dense
import numpy as np
import pandas as pd
def normalize(array):
return array/np.linalg.norm(array, ord=2, axis=1, keepdims=True)
x_train = pd.read_csv('data/labeled.csv', sep='\s+').iloc[:1800, :-1].values
y_train = pd.read_csv('data/labeled.csv', sep='\s+').iloc[:1800, -1:].values
x_test = pd.read_csv('data/labeled.csv', sep='\s+').iloc[1800:, :-1].values
y_test = pd.read_csv('data/labeled.csv', sep='\s+').iloc[1800:, -1:].values
x_train = normalize(x_train)
x_test = normalize(x_test)
model = tf.keras.Sequential()
model.add(Dense(9, input_dim=13, activation=tf.nn.sigmoid)
model.add(Dense(6, activation=tf.nn.sigmoid))
model.add(Dense(1, activation=tf.nn.sigmoid))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=50)
model.evaluate(x_test, y_test)
As mentionned, some of the epochs start with a higher accuracy than they finish with, leading to unstable learning.
32/1800 [..............................] - ETA: 0s - loss: 0.6830 - acc: 0.5938
1152/1800 [==================>...........] - ETA: 0s - loss: 0.6175 - acc: 0.6727
1800/1800 [==============================] - 0s 52us/step - loss: 0.6098 - acc: 0.6861
Epoch 54/250
32/1800 [..............................] - ETA: 0s - loss: 0.5195 - acc: 0.8125
1376/1800 [=====================>........] - ETA: 0s - loss: 0.6224 - acc: 0.6672
1800/1800 [==============================] - 0s 43us/step - loss: 0.6091 - acc: 0.6850
Epoch 55/250
What could be the cause of these oscillations in learning in such a simple model? Thanks
EDIT:
I have followed some suggestions from the comments and have modified the model accordingly. It now looks more like this:
model = tf.keras.Sequential()
model.add(Dense(250, input_dim=13, activation=tf.nn.relu))
model.add(Dropout(0.4))
model.add(Dense(200, activation=tf.nn.relu))
model.add(Dropout(0.4))
model.add(Dense(100, activation=tf.nn.relu))
model.add(Dropout(0.3))
model.add(Dense(50, activation=tf.nn.relu))
model.add(Dense(1, activation=tf.nn.sigmoid))
model.compile(optimizer='adadelta',
loss='binary_crossentropy',
metrics=['accuracy'])
Experimental results show that the accuracy of previous BNN models (e.g. XNOR-Net and Bi-Real-Net) can be improved by simply shifting the threshold values of binary activation functions without requiring any other modification.
Loss= abs(Y_pred – Y_actual) In this article, we will specifically focus on Binary Cross Entropy also known as Log loss, it is the most common loss function used for binary classification problems.
The size of validation set may be too small, such that small changes in the output causes large fluctuations in the validation error.
One of the easiest ways to increase validation accuracy is to add more data. This is especially useful if you don't have many training instances. If you're working on image recognition models, you may consider increasing the diversity of your available dataset by employing data augmentation.
Those are most definitely connected to the size of your network; each batch coming through changes your neural network considerably as it does not have enough neurons to represent the relationships.
It works fine for one batch, updates the weights for another and changes previously learned connections effectively "unlearning". That's why the loss is also jumpy as the network tries to accommodate to the task you have given it.
Sigmoid activation and it's saturation may be causing you troubles as well (as the gradient is squashed into small region and most gradient updates are zero). Quick fix - use ReLU
activation as described below.
Additionally, neural network does not care about accuracy, only about minimizing the loss value (which it tries to do most of the time). Say it predicts probabilities: [0.55, 0.55, 0.55, 0.55, 0.45]
for classes [1, 1, 1, 1, 0]
so it's accuracy is 100%
but it's pretty uncertain. Now, let's say the next update pushes the network into probabilities predictions: [0.8, 0.8, 0.8, 0.8, 0.55]
. In such case, loss would drop, but so would accuracy, from 100%
to 80%
.
BTW. You may want to check the scores for logistic regression and see how it performs on this task (so a single layer with output only).
It's always good to start with simple model and grow it bigger if needed (wouldn't advise the other way around). You may want to check on a really small subsample of data (say two/three batches, 160 elements or so) whether your model can learn the relationship between input and output.
In your case I doubt the model will be able to learn those relationships with the size of layers you are providing. Try increasing the size, especially in the earlier layers (maybe 50
/100
for starters) and see how it behaves.
Sigmoid easily saturates (small region where changes occur, most of the values are almost 0 or 1). It is rarely used nowadays as activation before bottleneck (final layer). Most common nowadays is ReLU
which is not prone to saturation (at least when the input is positive) or it's variations. This might help as well.
For each dataset and each neural network model optimal choice of learning rate is different. Defaults usually work so-so, but when the learning rate is too small it might get stuck in the local minima (and it's generalization will be worse), while the value being too big will make your network unstable (loss will highly oscillate).
You may want to read up on Cyclical Learning Rate (or in the original research paper by Leslie N. Smith. In there you can find info on how to choose a good learning rate heuristically and setup some simple learning rate schedulers. Those techniques were used by fast.ai teams in CIFAR10 competitions with really good results. On their site or in documentation of their library you can find One Cycle Policy
and learning rate finder (based on the work of aforementioned researcher). This should get you started in this realm I think.
Not sure, but this normalization looks pretty non-standard to me (never seen it done like that). Good normalization is the basis for neural network convergence (unless the data is already pretty close to normal distribution). Usually one subtracts the mean and divides by standard deviation for each feature. You can check some schemes in scikit-learn
library for example.
This shouldn't be an issue but if your input is complicated you should consider adding more layers to your neural network (right now it's almost definitely too thin). This would allow it to learn more abstract features and transform the input space more.
When the network overfits to the data you may employ some regularization techniques (hard to tell what might help, you should test it on your own), some of those include:
1e-2
or 1e-3
, you would have to test those values experimentally.N
epochs without improvement on the validation set you end your training. Pretty common technique, should be used almost every time. Remember to save the best model on validation set and set patience
(N
mentioned above) to some moderately sized value (do not set patience to 1 epoch or so, neural network may easily improve after 5 or so).Plus there are tons of other techniques you may find. Check what makes intuitive sense and which one you like the most and test how it performs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With