Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Back-propagation and forward-propagation for 2 hidden layers in neural network

My question is about forward and backward propagation for deep neural networks when the number of hidden units is greater than 1.

I know what I have to do if I have a single hidden layer. In case of a single hidden layer, if my input data X_train has n samples, with d number of features (i.e. X_train is a (n, d) dimensional matrix, y_train is a (n,1) dimensional vector) and if I have h1 number of hidden units in my first hidden layer, then I use Z_h1 = (X_train * w_h1) + b_h1 (where w_h1 is a weight matrix with random number entries which has the shape (d, h1) and b_h1 is a bias unit with shape (h1,1). I use sigmoid activation A_h1 = sigmoid(Z_h1) and find that both A_h1 and Z_h1 have shapes (n, h1). If I have t number of output units, then I use a weight matrix w_out with dimensions (h1, t) and b_out with shape (t,1) to get the output Z_out = (A_h1 * w_h1) + b_h1. From here I can get A_out = sigmoid(Z_out) which has shape (n, t). If I have a 2nd hidden layer (with h2 number of units) after the 1st hidden layer and before the output layer, then what steps must I add to the forward propagation and which steps should I modify?

I also have idea about how to tackle backpropagation in case of single hidden layer neural networks. For the single hidden layer example in the previous paragraph, I know that in the first backpropagation step (output layer -> hidden layer1), I should do Step1_BP1: Err_out = A_out - y_train_onehot (here y_train_onehot is the onehot representation of y_train. Err_out has shape (n,t). This is followed by Step2_BP1: delta_w_out = (A_h1)^T * Err_out and delta_b_out = sum(Err_out). The symbol (.)^T denotes the transpose of matrix. For the second backpropagation step (hidden layer1 -> input layer), we do the following Step1_BP2: sig_deriv_h1 = (A_h1) * (1-A_h1). Here sig_deriv_h1 has shape (n,h1). In the next step, I do Step2_BP2: Err_h1 = \Sum_i \Sum_j [ ( Err_out * w_out.T)_{i,j} * sig_deriv_h1__{i,j} )]. Here, Err_h1 has shape (n,h1). In the final step, I do Step3_BP2: delta_w_h1 = (X_train)^T * Err_h1 and delta_b_h1 = sum(Err_h1). What backpropagation steps should I add if I have a 2nd hidden layer (h2 number of units) after the 1st hidden layer and before the output layer? Should I modify the backpropagation steps for the one hidden layer case that I have described here?

like image 294
user10853036 Avatar asked Sep 10 '25 17:09

user10853036


1 Answers

● Let X be a matrix of samples with shape (n, d), where n denotes number of samples, and d denotes number of features.

● Let wh1 be the matrix of weights - of shape (d, h1) , and

● Let bh1 be the bias vector of shape (1, h1).

You need the following steps for forward and backward propagations:

FORWARD PROPAGATION:

Step 1:

Zh1       =       [ X   •   wh1 ]     +     bh1

↓                       ↓         ↓                   ↓

(n,h1)     (n,d)   (d,h1)     (1,h1)

Here, the symbol • represents matrix multiplication, and the h1 denotes the number of hidden units in the first hidden layer.

Step 2:

Let Φ() be the activation function. We get.

ah1     =     Φ (Zh1)

  ↓                   ↓

(n,h1)       (n,h1)

Step 3:

Obtain new weights and biases:

wh2 of shape (h1, h2), and

bh2 of shape (1, h2).

Step 4:

Zh2       =       [ ah1   •   wh2 ]     +     bh2

↓                       ↓           ↓                   ↓

(n,h2)     (n,h1)   (h1,h2)     (1,h2)

Here, h2 is the number of hidden units in the second hidden layer.

Step 5:

ah2     =     Φ (Zh2)

  ↓                   ↓

(n,h2)       (n,h2)

Step 6:

Obtain new weights and biases:

wout of shape (h2, t), and

bout of shape (1, t).

Here, t is the number of classes.

Step 7:

Zout       =       [ ah2   •   wout ]     +     bout

↓                         ↓           ↓                   ↓

(n,t)       (n,h2)   (h2,t)     (1,t)

Step 8:

aout     =     Φ (Zout)

  ↓                   ↓

(n,t)       (n,t)

BACKWARD PROPAGATION:

Step 1:

Construct the one-hot encoded matrix of the unique output classes ( yone-hot ).

Errorout     =     aout   -   yone-hot

    ↓                     ↓               ↓

(n,t)           (n,t)       (n,t)

Step 2:

Δwout     =     η ( ah2T   •   Errorout )

    ↓                       ↓               ↓

(h2,t)         (h2,n)     (n,t)

Δbout     =     η [ ∑ i=1n  (Errorout,i) ]

    ↓                                 ↓

(1,t)                     (1,t)

Here η is the learning rate.

wout = wout - Δwout         (weight update.)

bout = bout - Δbout         (bias update.)

Step 3:

Error2     =     [Errorout   •   woutT]   ✴   Φ/ (ah2)

    ↓                     ↓                   ↓                   ↓

(n,h2)         (n,t)         (t,h2)       (n,h2)

Here, the symbol ✴ denotes element wise matrix multiplication. The symbol Φ/ represents derivative of sigmoid function.

Step 4:

Δwh2     =     η ( ah1T   •   Error2 )

    ↓                       ↓               ↓

(h1,h2)         (h1,n)     (n,h2)

Δbh2     =     η [ ∑ i=1n  (Error2,i) ]

    ↓                                 ↓

(1,h2)                     (1,h2)

wh2 = wh2 - Δwh2         (weight update.)

bh2 = bh2 - Δbh2         (bias update.)

Step 5:

Error3     =     [Error2   •   wh2T]   ✴   Φ/ (ah1)

    ↓                     ↓               ↓                   ↓

(n,h1)       (n,h2)     (h2,h1)       (n,h1)

Step 6:

Δwh1     =     η ( XT   •   Error3 )

    ↓                     ↓               ↓

(d,h1)         (d,n)     (n,h1)

Δbh1     =     η [ ∑ i=1n  (Error3,i) ]

    ↓                                 ↓

(1,h1)                     (1,h1)

wh1 = wh1 - Δwh1         (weight update.)

bh1 = bh1 - Δbh1         (bias update.)

like image 133
Siddharth Satpathy Avatar answered Sep 13 '25 06:09

Siddharth Satpathy