Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Backpropagation algorithm giving bad results

I'm trying to tackle the classic handwritten digit recognition problem with a feed forward neural network and backpropagation, using the MNIST dataset. I'm using Michael Nielsen's book to learn the essentials and 3Blue1Brown's youtube video for the backpropagation algorithm.

I finished writing it some time ago and been debugging since, because the results are quite bad. At its best the network can recognize ~4000/10000 samples after 1 epoch and that number only drops on the following epochs, which lead me to believe there's some issue with the backpropagation algorithm. I've been drowning in index hell trying to debug this for the last few days and can't figure out where the issue is, I'd appreciate any help in pointing it out.

A bit of background: 1) I'm not using any matrix multiplication and no external frameworks, but doing everything with for loops because that's how I learned it from the video. 2) Unlike the book, I'm storing both weights and biases in the same array. The biases for every layer are a column at the end of the weight matrix for that layer.

And finally for the code, this is the Backpropagate method of the NeuralNetwork class, which is called in UpdateMiniBatch, which itself is called in SGD:

/// <summary>
/// Returns the partial derivative of the cost function on one sample with respect to every weight in the network.
/// </summary>
public List<double[,]> Backpropagate(ITrainingSample sample)
{
    // Forwards pass
    var (weightedInputs, activations) = GetWeightedInputsAndActivations(sample.Input);

    // The derivative with respect to the activation of the last layer is simple to compute: activation - expectedActivation
    var errors = activations.Last().Select((a, i) => a - sample.Output[i]).ToArray();

    // Backwards pass
    List<double[,]> delCostOverDelWeights = Weights.Select(x => new double[x.GetLength(0), x.GetLength(1)]).ToList();
    List<double[]> delCostOverDelActivations = Weights.Select(x => new double[x.GetLength(0)]).ToList();
    delCostOverDelActivations[delCostOverDelActivations.Count - 1] = errors;

    // Comment notation:
    // Cost function: C
    // Weight connecting the i-th neuron on the (l + 1)-th layer to the j-th neuron on the l-th layer: w[l][i, j]
    // Bias of the i-th neuron on the (l + 1)-th layer: b[l][i]
    // Activation of the i-th neuon on the l-th layer: a[l][i]
    // Weighted input of the i-th neuron on the l-th layer: z[l][i] // which doesn't make sense on layer 0, but is left for index convenience
    // Notice that weights, biases, delCostOverDelWeights and delCostOverDelActivation all start at layer 1 (the 0-th layer is irrelevant to their meanings) while activations and weightedInputs strat at the 0-th layer

    for (int l = Weights.Count - 1; l >= 0; l--)
    {
        //Calculate ∂C/∂w for the current layer:
        for (int i = 0; i < Weights[l].GetLength(0); i++)
            for (int j = 0; j < Weights[l].GetLength(1); j++)
                delCostOverDelWeights[l][i, j] = // ∂C/∂w[l][i, j]
                    delCostOverDelActivations[l][i] * // ∂C/∂a[l + 1][i]
                    SigmoidPrime(weightedInputs[l + 1][i]) * // ∂a[l + 1][i]/∂z[l + 1][i] = ∂(σ(z[l + 1][i]))/∂z[l + 1][i] = σ′(z[l + 1][i])
                    (j < Weights[l].GetLength(1) - 1 ? activations[l][j] : 1); // ∂z[l + 1][i]/∂w[l][i, j] = a[l][j] ||OR|| ∂z[l + 1][i]/∂b[l][i] = 1

        // Calculate ∂C/∂a for the previous layer(a[l]):
        if (l != 0)
            for (int i = 0; i < Weights[l - 1].GetLength(0); i++)
                for (int j = 0; j < Weights[l].GetLength(0); j++)
                    delCostOverDelActivations[l - 1][i] += // ∂C/∂a[l][i] = sum over j:
                        delCostOverDelActivations[l][j] * // ∂C/∂a[l + 1][j]
                        SigmoidPrime(weightedInputs[l + 1][j]) * // ∂a[l + 1][j]/∂z[l + 1][j] = ∂(σ(z[l + 1][j]))/∂z[l + 1][j] = σ′(z[l + 1][j])
                        Weights[l][j, i]; // ∂z[l + 1][j]/∂a[l][i] = w[l][j, i]
    }

    return delCostOverDelWeights;
}

GetWeightedInputsAndActivations:

public (List<double[]>, List<double[]>) GetWeightedInputsAndActivations(double[] input)
{
    List<double[]> activations = new List<double[]>() { input }.Concat(Weights.Select(x => new double[x.GetLength(0)])).ToList();
    List<double[]> weightedInputs = activations.Select(x => new double[x.Length]).ToList();

    for (int l = 0; l < Weights.Count; l++)
        for (int i = 0; i < Weights[l].GetLength(0); i++)
        {
            double value = 0;
            for (int j = 0; j < Weights[l].GetLength(1) - 1; j++)
                value += Weights[l][i, j] * activations[l][j];// weights
            weightedInputs[l + 1][i] = value + Weights[l][i, Weights[l].GetLength(1) - 1];// bias
            activations[l + 1][i] = Sigmoid(weightedInputs[l + 1][i]);
        }

    return (weightedInputs, activations);
}

The entire NeuralNetwork as well as everything else can be found here.

EDIT: after many significant changes to the repo the above link might no longer be functional, but should hopefully be irrelevant considering the answer. For completeness' sake this is a functional link to the changed repository.

like image 555
H. Saleh Avatar asked May 29 '19 17:05

H. Saleh


People also ask

What is the drawback of the backpropagation algorithm?

The biggest disadvantages of backpropagation are: Backpropagation could be rather sensitive to noisy data and irregularity. The performance of backpropagation relies very heavily on the training data. Backpropagation needs a very large amount of time for training.

What are the main problems with the back propagation learning algorithm?

Because each expert is only utilized for a few instances of inputs, back-propagation is slow and unreliable. And when new circumstances arise, the Mixture of Experts cannot adapt its parsing quickly. If a circumstance requires a new kind of expertise, existing Mixtures of Experts cannot add that specialization.

What is wrong about backpropagation?

In short, you can't do back-propagation if you don't have an objective function. You can't have an objective function if you don't have a measure between a predicted value and a labeled (actual or training data) value. So to achieve “unsupervised learning”, you may have ditch the ability to calculate a gradient.

What are general limitations of back propagation rule?

One of the major disadvantages of the backpropagation learning rule is its ability to get stuck in local minima. The error is a function of all the weights in a multidimensional space.


1 Answers

Fixed. The issue was: I didn't divide the pixel inputs by 255. Everything else seems to work correctly, and I'm now getting +9000/10000 on the first epoch.

like image 126
H. Saleh Avatar answered Oct 02 '22 11:10

H. Saleh