Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can somebody please explain the backpropagation algorithm to me?

I've recently completed Professor Ng's Machine Learning course on Coursera, and while I loved the entire course, I never really managed to understand the backpropagation algorithm for training neural networks.

My problem with understanding it is, he only ever teaches the vectorised implementation of it for fully-connected feed-forward networks. My linear algebra is rusty, and I think it would be much easier to understand if somebody could teach me the general purpose algorithm. Maybe in a node-oriented fashion.

I'll try and phrase the problem simply, but I may be misunderstanding how backprop works, so if this doesn't make sense, disregard it:

For any given node N, given the input weights/values, the output weights/values, and the error/cost of all the nodes that N outputs to, how do I calculate the "cost" of N and use this to update the input weights?

like image 756
Jonathon Ashworth Avatar asked Oct 27 '12 00:10

Jonathon Ashworth


1 Answers

Let's consider a node in a back-propagation (BP) network. It has multiple inputs, and produces an output value. We want to use error-correction for training, so it will also update weights based on an error estimate for the node.

Each node has a bias value, θ. You can think of this as a weight to an internal, constant 1.0 valued input.

The activation is a summation of weighted inputs and the bias value. Let's refer to our node of interest as j, nodes in the preceding layer with values of i, and nodes in the succeeding layer with values of k. The activation of our node j is then:

netj = ∑i (oi × wij) + θj

That is, the activation value for j is the sum of the products of output from a node i and the corresponding weight linking node i and j, plus the bias value.

The output of our node j is a transfer function of the activation:

oj = f(netj)

f is commonly the sigmoid function.

f(netj) = 1 / (1 + e-netj)

The sigmoid function has an easy to specify first derivative:

f'(netj) = f(netj) × (1.0 - f(netj))

Whatever transfer function we use, we need to know how to compute its first derivative. BP works by gradient descent via the Chain Rule, so that is important. The equation above will be different with a different transfer function.

So far, we know how to get input values, compute the activation, compute the output, and compute the first derivative of the activation. Now we need to deal with errors and weight adjustment.

The value used for a node error estimate in BP is called δ. The δ for a node is proportional to the first derivative of the node's activation and an error term it receives. There are two formulations for the received error term, one for output nodes and one for hidden nodes.

Generically,

δ = f'(net) × (received error)

For an output node,

δoutput = f'(net) × (t - o)

where t is the expected value at that output node, and o is the actual output value of that output node.

For our hidden node j, it is like this:

δj = f'(netj) × ∑kk × wjk)

The δ for our node j, δj, is the product of the first derivative of our transfer function given activation times the sum of the deltas in the next layer (closer to the output) multiplied each with the value of the connecting weight. With that in hand, we can calculate how to adjust the weights going to the preceding layer of nodes (closer to the input).

dwij = L × oi × δj

dw here represents "change in weight", so what the equation says is that the change in a weight from node i to our node j is equal to the product of the learning parameter L (typically the same value for all nodes in the network), the output value of node i, and the δ (error term) for our node j.

Adjusting the bias value is similar to adjusting a weight.

j = L × f(θj) × δj

dθ here represents "change in θ". We have to apply the transfer function to the bias value θj to get the term like the output from a node. Otherwise, it looks just like the other equation.

I should note that calculating the weight changes should be done network-wide, and then apply the changes after all those are calculated.

like image 105
Wesley R. Elsberry Avatar answered Oct 17 '22 07:10

Wesley R. Elsberry