My question is about forward and backward propagation for deep neural networks when the number of hidden units is greater than 1.
I know what I have to do if I have a single hidden layer. In case of a single hidden layer, if my input data X_train
has n
samples, with d
number of features (i.e. X_train
is a (n, d)
dimensional matrix, y_train
is a (n,1)
dimensional vector) and if I have h1
number of hidden units in my first hidden layer, then I use Z_h1 = (X_train * w_h1) + b_h1
(where w_h1
is a weight matrix with random number entries which has the shape (d, h1)
and b_h1
is a bias unit with shape (h1,1)
. I use sigmoid activation A_h1 = sigmoid(Z_h1)
and find that both A_h1
and Z_h1
have shapes (n, h1)
. If I have t
number of output units, then I use a weight matrix w_out
with dimensions (h1, t)
and b_out
with shape (t,1)
to get the output Z_out = (A_h1 * w_h1) + b_h1
. From here I can get A_out = sigmoid(Z_out)
which has shape (n, t)
. If I have a 2nd hidden layer (with h2 number of units) after the 1st hidden layer and before the output layer, then what steps must I add to the forward propagation and which steps should I modify?
I also have idea about how to tackle backpropagation in case of single hidden layer neural networks. For the single hidden layer example in the previous paragraph, I know that in the first backpropagation step (output layer -> hidden layer1)
, I should do Step1_BP1: Err_out = A_out - y_train_onehot
(here y_train_onehot
is the onehot representation of y_train
. Err_out
has shape (n,t)
. This is followed by Step2_BP1: delta_w_out = (A_h1)^T * Err_out
and delta_b_out = sum(Err_out)
. The symbol (.)^T
denotes the transpose of matrix. For the second backpropagation step (hidden layer1 -> input layer)
, we do the following Step1_BP2: sig_deriv_h1 = (A_h1) * (1-A_h1)
. Here sig_deriv_h1
has shape (n,h1)
. In the next step, I do Step2_BP2: Err_h1 = \Sum_i \Sum_j [ ( Err_out * w_out.T)_{i,j} * sig_deriv_h1__{i,j} )
]. Here, Err_h1
has shape (n,h1)
. In the final step, I do Step3_BP2: delta_w_h1 = (X_train)^T * Err_h1
and delta_b_h1 = sum(Err_h1)
. What backpropagation steps should I add if I have a 2nd hidden layer (h2 number of units) after the 1st hidden layer and before the output layer? Should I modify the backpropagation steps for the one hidden layer case that I have described here?
● Let X be a matrix of samples with shape (n, d)
, where n
denotes number of samples, and d
denotes number of features.
● Let wh1 be the matrix of weights - of shape (d, h1)
, and
● Let bh1 be the bias vector of shape (1, h1)
.
You need the following steps for forward and backward propagations:
► FORWARD PROPAGATION:
⛶ Step 1:
Zh1 = [ X • wh1 ] + bh1
↓ ↓ ↓ ↓
(n,h1)
(n,d)
(d,h1)
(1,h1)
Here, the symbol • represents matrix multiplication, and the h1
denotes the number of hidden units in the first hidden layer.
⛶ Step 2:
Let Φ() be the activation function. We get.
ah1 = Φ (Zh1)
↓ ↓
(n,h1)
(n,h1)
⛶ Step 3:
Obtain new weights and biases:
● wh2 of shape (h1, h2)
, and
● bh2 of shape (1, h2)
.
⛶ Step 4:
Zh2 = [ ah1 • wh2 ] + bh2
↓ ↓ ↓ ↓
(n,h2)
(n,h1)
(h1,h2)
(1,h2)
Here, h2
is the number of hidden units in the second hidden layer.
⛶ Step 5:
ah2 = Φ (Zh2)
↓ ↓
(n,h2)
(n,h2)
⛶ Step 6:
Obtain new weights and biases:
● wout of shape (h2, t)
, and
● bout of shape (1, t)
.
Here, t
is the number of classes.
⛶ Step 7:
Zout = [ ah2 • wout ] + bout
↓ ↓ ↓ ↓
(n,t)
(n,h2)
(h2,t)
(1,t)
⛶ Step 8:
aout = Φ (Zout)
↓ ↓
(n,t)
(n,t)
► BACKWARD PROPAGATION:
⛶ Step 1:
Construct the one-hot encoded matrix of the unique output classes ( yone-hot ).
Errorout = aout - yone-hot
↓ ↓ ↓
(n,t)
(n,t)
(n,t)
⛶ Step 2:
Δwout = η ( ah2T • Errorout )
↓ ↓ ↓
(h2,t)
(h2,n)
(n,t)
Δbout = η [ ∑ i=1n (Errorout,i) ]
↓ ↓
(1,t)
(1,t)
Here η is the learning rate.
wout = wout - Δwout (weight update.)
bout = bout - Δbout (bias update.)
⛶ Step 3:
Error2 = [Errorout • woutT] ✴ Φ/ (ah2)
↓ ↓ ↓ ↓
(n,h2)
(n,t)
(t,h2)
(n,h2)
Here, the symbol ✴ denotes element wise matrix multiplication. The symbol Φ/ represents derivative of sigmoid function.
⛶ Step 4:
Δwh2 = η ( ah1T • Error2 )
↓ ↓ ↓
(h1,h2)
(h1,n)
(n,h2)
Δbh2 = η [ ∑ i=1n (Error2,i) ]
↓ ↓
(1,h2)
(1,h2)
wh2 = wh2 - Δwh2 (weight update.)
bh2 = bh2 - Δbh2 (bias update.)
⛶ Step 5:
Error3 = [Error2 • wh2T] ✴ Φ/ (ah1)
↓ ↓ ↓ ↓
(n,h1)
(n,h2)
(h2,h1)
(n,h1)
⛶ Step 6:
Δwh1 = η ( XT • Error3 )
↓ ↓ ↓
(d,h1)
(d,n)
(n,h1)
Δbh1 = η [ ∑ i=1n (Error3,i) ]
↓ ↓
(1,h1)
(1,h1)
wh1 = wh1 - Δwh1 (weight update.)
bh1 = bh1 - Δbh1 (bias update.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With