Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stuck implementing simple neural network

I've been bashing my head against this brick wall for what seems like an eternity, and I just can't seem to wrap my head around it. I'm trying to implement an autoencoder using only numpy and matrix multiplication. No theano or keras tricks allowed.

I'll describe the problem and all its details. It is a bit complex at first since there are a lot of variables, but it really is quite straightforward.

What we know

1) X is an m by n matrix which is our inputs. The inputs are rows of this matrix. Each input is an n dimensional row vector, and we have m of them.

2)The number of neurons in our (single) hidden layer, which is k.

3) The activation function of our neurons (sigmoid, will be denoted as g(x)) and its derivative g'(x)

What we don't know and want to find

Overall our goal is to find 6 matrices: w1 which is n by k, b1 which is m by k, w2 which is k by n, b2 which is m by n, w3 which is n by n and b3 which is m by n.

They are initallized randomly and we find the best solution using gradient descent.

The process

The entire process looks something like this enter image description here

First we compute z1 = Xw1+b1. It is m by k and is the input to our hidden layer. We then compute h1 = g(z1), which is simply applying the sigmoid function to all elements of z1. naturally it is also m by k and is the output of our hidden layer.

We then compute z2 = h1w2+b2 which is m by n and is the input to the output layer of our neural network. Then we compute h2 = g(z2) which again is naturally also m by n and is the output of our neural network.

Finally, we take this output and perform some linear operator on it: Xhat = h2w3+b3 which is also m by n and is our final result.

Where I am stuck

The cost function I want to minimize is the mean squared error. I already implemented it in numpy code

def cost(x, xhat):
    return (1.0/(2 * m)) * np.trace(np.dot(x-xhat,(x-xhat).T))

The problem is finding the derivatives of cost with respect to w1,b1,w2,b2,w3,b3. Let's call the cost S.

After deriving myself and checking myself numerically, I have established the following facts:

1) dSdxhat = (1/m) * np.dot(xhat-x)

2) dSdw3 = np.dot(h2.T,dSdxhat)

3) dSdb3 = dSdxhat

4) dSdh2 = np.dot(dSdxhat, w3.T)

But I can't for the life of me figure out dSdz2. It's a brick wall.

From chain-rule, it should be that dSdz2 = dSdh2 * dh2dz2 but the dimensions don't match.

What is the formula to compute the derivative of S with respect to z2?

Edit - This is my code for the entire feed forward operation of the autoencoder.

import numpy as np

def g(x): #sigmoid activation functions
    return 1/(1+np.exp(-x)) #same shape as x!

def gGradient(x): #gradient of sigmoid
    return g(x)*(1-g(x)) #same shape as x!

def cost(x, xhat): #mean squared error between x the data and xhat the output of the machine
    return (1.0/(2 * m)) * np.trace(np.dot(x-xhat,(x-xhat).T))

#Just small random numbers so we can test that it's working small scale
m = 5 #num of examples
n = 2 #num of features in each example
k = 2 #num of neurons in the hidden layer of the autoencoder
x = np.random.rand(m, n) #the data, shape (m, n)

w1 = np.random.rand(n, k) #weights from input layer to hidden layer, shape (n, k)
b1 = np.random.rand(m, k) #bias term from input layer to hidden layer (m, k)
z1 = np.dot(x,w1)+b1 #output of the input layer, shape (m, k)
h1 = g(z1) #input of hidden layer, shape (m, k)

w2 = np.random.rand(k, n) #weights from hidden layer to output layer of the autoencoder, shape (k, n)
b2 = np.random.rand(m, n) #bias term from hidden layer to output layer of autoencoder, shape (m, n)
z2 = np.dot(h1, w2)+b2 #output of the hidden layer, shape (m, n)
h2 = g(z2) #Output of the entire autoencoder. The output layer of the autoencoder. shape (m, n)

w3 = np.random.rand(n, n) #weights from output layer of autoencoder to entire output of the machine, shape (n, n)
b3 = np.random.rand(m, n) #bias term from output layer of autoencoder to entire output of the machine, shape (m, n)
xhat = np.dot(h2, w3)+b3 #the output of the machine, which hopefully resembles the original data x, shape (m, n)
like image 768
Oria Gruber Avatar asked Oct 01 '16 08:10

Oria Gruber


1 Answers

OK, here's a suggestion. In the vector case, if you have x as a vector of length n, then g(x) is also a vector of length n. However, g'(x) is not a vector, it's the Jacobian matrix, and will be of size n X n. Similarly, in the minibatch case, where X is a matrix of size m X n, g(X) is m X n but g'(X) is n X n. Try:

def gGradient(x): #gradient of sigmoid
    return np.dot(g(x).T, 1 - g(x))

@Paul is right that the bias terms should be vectors, not matrices. You should have:

b1 = np.random.rand(k) #bias term from input layer to hidden layer (k,)
b2 = np.random.rand(n) #bias term from hidden layer to output layer of autoencoder, shape (n,)
b3 = np.random.rand(n) #bias term from output layer of autoencoder to entire output of the machine, shape (n,)

Numpy's broadcasting means that you don't have to change your calculation of xhat.

Then (I think!) you can compute the derivatives like this:

dSdxhat = (1/float(m)) * (xhat-x)
dSdw3 = np.dot(h2.T,dSdxhat)
dSdb3 = dSdxhat.mean(axis=0)
dSdh2 = np.dot(dSdxhat, w3.T)
dSdz2 = np.dot(dSdh2, gGradient(z2))
dSdb2 = dSdz2.mean(axis=0)
dSdw2 = np.dot(h1.T,dSdz2)
dSdh1 = np.dot(dSdz2, w2.T)
dSdz1 = np.dot(dSdh1, gGradient(z1))
dSdb1 = dSdz1.mean(axis=0)
dSdw1 = np.dot(x.T,dSdz1)

Does this work for you?

Edit

I've decided that I'm not at all sure that gGradient is supposed to be a matrix. How about:

dSdxhat = (xhat-x) / m
dSdw3 = np.dot(h2.T,dSdxhat)
dSdb3 = dSdxhat.sum(axis=0)
dSdh2 = np.dot(dSdxhat, w3.T)
dSdz2 = h2 * (1-h2) * dSdh2
dSdb2 = dSdz2.sum(axis=0)
dSdw2 = np.dot(h1.T,dSdz2)
dSdh1 = np.dot(dSdz2, w2.T)
dSdz1 = h1 * (1-h1) * dSdh1
dSdb1 = dSdz1.sum(axis=0)
dSdw1 = np.dot(x.T,dSdz1)
like image 184
wildwilhelm Avatar answered Oct 24 '22 14:10

wildwilhelm