I'm trying to tackle the classic handwritten digit recognition problem with a feed forward neural network and backpropagation, using the MNIST dataset. I'm using Michael Nielsen's book to learn the essentials and 3Blue1Brown's youtube video for the backpropagation algorithm.
I finished writing it some time ago and been debugging since, because the results are quite bad. At its best the network can recognize ~4000/10000 samples after 1 epoch and that number only drops on the following epochs, which lead me to believe there's some issue with the backpropagation algorithm. I've been drowning in index hell trying to debug this for the last few days and can't figure out where the issue is, I'd appreciate any help in pointing it out.
A bit of background: 1) I'm not using any matrix multiplication and no external frameworks, but doing everything with for loops because that's how I learned it from the video. 2) Unlike the book, I'm storing both weights and biases in the same array. The biases for every layer are a column at the end of the weight matrix for that layer.
And finally for the code, this is the Backpropagate method of the NeuralNetwork class, which is called in UpdateMiniBatch, which itself is called in SGD:
/// <summary>
/// Returns the partial derivative of the cost function on one sample with respect to every weight in the network.
/// </summary>
public List<double[,]> Backpropagate(ITrainingSample sample)
{
// Forwards pass
var (weightedInputs, activations) = GetWeightedInputsAndActivations(sample.Input);
// The derivative with respect to the activation of the last layer is simple to compute: activation - expectedActivation
var errors = activations.Last().Select((a, i) => a - sample.Output[i]).ToArray();
// Backwards pass
List<double[,]> delCostOverDelWeights = Weights.Select(x => new double[x.GetLength(0), x.GetLength(1)]).ToList();
List<double[]> delCostOverDelActivations = Weights.Select(x => new double[x.GetLength(0)]).ToList();
delCostOverDelActivations[delCostOverDelActivations.Count - 1] = errors;
// Comment notation:
// Cost function: C
// Weight connecting the i-th neuron on the (l + 1)-th layer to the j-th neuron on the l-th layer: w[l][i, j]
// Bias of the i-th neuron on the (l + 1)-th layer: b[l][i]
// Activation of the i-th neuon on the l-th layer: a[l][i]
// Weighted input of the i-th neuron on the l-th layer: z[l][i] // which doesn't make sense on layer 0, but is left for index convenience
// Notice that weights, biases, delCostOverDelWeights and delCostOverDelActivation all start at layer 1 (the 0-th layer is irrelevant to their meanings) while activations and weightedInputs strat at the 0-th layer
for (int l = Weights.Count - 1; l >= 0; l--)
{
//Calculate ∂C/∂w for the current layer:
for (int i = 0; i < Weights[l].GetLength(0); i++)
for (int j = 0; j < Weights[l].GetLength(1); j++)
delCostOverDelWeights[l][i, j] = // ∂C/∂w[l][i, j]
delCostOverDelActivations[l][i] * // ∂C/∂a[l + 1][i]
SigmoidPrime(weightedInputs[l + 1][i]) * // ∂a[l + 1][i]/∂z[l + 1][i] = ∂(σ(z[l + 1][i]))/∂z[l + 1][i] = σ′(z[l + 1][i])
(j < Weights[l].GetLength(1) - 1 ? activations[l][j] : 1); // ∂z[l + 1][i]/∂w[l][i, j] = a[l][j] ||OR|| ∂z[l + 1][i]/∂b[l][i] = 1
// Calculate ∂C/∂a for the previous layer(a[l]):
if (l != 0)
for (int i = 0; i < Weights[l - 1].GetLength(0); i++)
for (int j = 0; j < Weights[l].GetLength(0); j++)
delCostOverDelActivations[l - 1][i] += // ∂C/∂a[l][i] = sum over j:
delCostOverDelActivations[l][j] * // ∂C/∂a[l + 1][j]
SigmoidPrime(weightedInputs[l + 1][j]) * // ∂a[l + 1][j]/∂z[l + 1][j] = ∂(σ(z[l + 1][j]))/∂z[l + 1][j] = σ′(z[l + 1][j])
Weights[l][j, i]; // ∂z[l + 1][j]/∂a[l][i] = w[l][j, i]
}
return delCostOverDelWeights;
}
GetWeightedInputsAndActivations:
public (List<double[]>, List<double[]>) GetWeightedInputsAndActivations(double[] input)
{
List<double[]> activations = new List<double[]>() { input }.Concat(Weights.Select(x => new double[x.GetLength(0)])).ToList();
List<double[]> weightedInputs = activations.Select(x => new double[x.Length]).ToList();
for (int l = 0; l < Weights.Count; l++)
for (int i = 0; i < Weights[l].GetLength(0); i++)
{
double value = 0;
for (int j = 0; j < Weights[l].GetLength(1) - 1; j++)
value += Weights[l][i, j] * activations[l][j];// weights
weightedInputs[l + 1][i] = value + Weights[l][i, Weights[l].GetLength(1) - 1];// bias
activations[l + 1][i] = Sigmoid(weightedInputs[l + 1][i]);
}
return (weightedInputs, activations);
}
The entire NeuralNetwork as well as everything else can be found here.
EDIT: after many significant changes to the repo the above link might no longer be functional, but should hopefully be irrelevant considering the answer. For completeness' sake this is a functional link to the changed repository.
The biggest disadvantages of backpropagation are: Backpropagation could be rather sensitive to noisy data and irregularity. The performance of backpropagation relies very heavily on the training data. Backpropagation needs a very large amount of time for training.
Because each expert is only utilized for a few instances of inputs, back-propagation is slow and unreliable. And when new circumstances arise, the Mixture of Experts cannot adapt its parsing quickly. If a circumstance requires a new kind of expertise, existing Mixtures of Experts cannot add that specialization.
In short, you can't do back-propagation if you don't have an objective function. You can't have an objective function if you don't have a measure between a predicted value and a labeled (actual or training data) value. So to achieve “unsupervised learning”, you may have ditch the ability to calculate a gradient.
One of the major disadvantages of the backpropagation learning rule is its ability to get stuck in local minima. The error is a function of all the weights in a multidimensional space.
Fixed. The issue was: I didn't divide the pixel inputs by 255. Everything else seems to work correctly, and I'm now getting +9000/10000 on the first epoch.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With