I am writing a simple implementation of the MLP with a single output unit (binary classification). I need it for teaching purposes, so I can't use existing implementation :(
I managed to create a working dummy model and implemented training function, but the MLP does not converge. Indeed, gradient for the output unit remains high over epochs, so its weights approach infinity.
My implementation:
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
X = np.loadtxt('synthetic.txt')
t = X[:, 2].astype(np.int)
X = X[:, 0:2]
# Sigmoid activation function for output unit
def logistic(x):
return 1/(1 + np.exp(-x))
# derivative of the tanh activation function for hidden units
def tanh_deriv(x):
return 1 - np.tanh(x)*np.tanh(x)
input_num = 2 # number of units in the input layer
hidden_num = 2 # number of units in the hidden layer
# initialize weights with random values:
weights_hidden = np.array((2 * np.random.random( (input_num + 1, hidden_num + 1) ) - 1 ) * 0.25)
weights_out = np.array((2 * np.random.random( hidden_num + 1 ) - 1 ) * 0.25)
def predict(x):
global input_num
global hidden_num
global weights_hidden
global weights_out
x = np.append(x.astype(float), 1.0) # input to the hidden layer: features + bias term
a = x.dot(weights_hidden) # activations of the hidden layer
z = np.tanh(a) # output of the hidden layer
q = logistic(z.dot(weights_out)) # input to the output (decision) layer
if q >= 0.5:
return 1
return 0
def train(X, t, learning_rate=0.2, epochs=50):
global input_num
global hidden_num
global weights_hidden
global weights_out
weights_hidden = np.array((2 * np.random.random( (input_num + 1, hidden_num + 1) ) - 1 ) * 0.25)
weights_out = np.array((2 * np.random.random( hidden_num + 1 ) - 1 ) * 0.25)
for epoch in range(epochs):
gradient_out = 0.0 # gradients for output and hidden layers
gradient_hidden = []
for i in range(X.shape[0]):
# forward propagation
x = np.array(X[i])
x = np.append(x.astype(float), 1.0) # input to the hidden layer: features + bias term
a = x.dot(weights_hidden) # activations of the hidden layer
z = np.tanh(a) # output of the hidden layer
q = z.dot(weights_out) # activations to the output (decision) layer
y = logistic(q) # output of the decision layer
# backpropagation
delta_hidden_s = [] # delta and gradient for a single training sample (hidden layer)
gradient_hidden_s = []
delta_out_s = t[i] - y # delta and gradient for a single training sample (output layer)
gradient_out_s = delta_out_s * z
for j in range(hidden_num + 1):
delta_hidden_s.append(tanh_deriv(a[j]) * (weights_out[j] * delta_out_s))
gradient_hidden_s.append(delta_hidden_s[j] * x)
gradient_out = gradient_out + gradient_out_s # accumulate gradients over training set
gradient_hidden = gradient_hidden + gradient_hidden_s
print "\n#", epoch, "Gradient out: ",gradient_out,
print "\n Weights out: ", weights_out
# Now updating weights
weights_out = weights_out - learning_rate * gradient_out
for j in range(hidden_num + 1):
weights_hidden.T[j] = weights_hidden.T[j] - learning_rate * gradient_hidden[j]
train(X, t, 0.2, 50)
And the evolution of gradient and weights for the output unit over epoch:
0 Gradient out: [ 11.07640724 -7.20309009 0.24776626]
Weights out: [-0.15397237 0.22232593 0.03162811]
1 Gradient out: [ 23.68791197 -19.6688382 -1.75324703]
Weights out: [-2.36925382 1.66294395 -0.01792515]
2 Gradient out: [ 79.08612305 -65.76066015 -7.70115262]
Weights out: [-7.10683621 5.59671159 0.33272426]
3 Gradient out: [ 99.59798656 -93.90973727 -21.45674943]
Weights out: [-22.92406082 18.74884362 1.87295478]
...
49 Gradient out: [ 107.89975864 -105.8654327 -104.69591522]
Weights out: [-1003.67912726 976.87213404 922.38862049]
I tried different datasets, various number of hidden units. I tried to update weights with addition instead of substraction... Nothing helps...
Could somebody tell me what might be wrong? Thanks in advance
I do not believe that you should use sum of squares error function for binary classification. Instead you should use the cross entropy error function, which is basically a likelihood function. This way the error will get much more expensive the longer your prediction is from the correct answer. Please read the section about "Network Training" pp. 235 in "Pattern Recognition and Machine Learning" by Christopher Bishop, this will give you a proper overview on how to do supervised learning on a FFNN.
The bias units are extremely important, thus they make it possible for the transfer funct. to shift along the x-curve. The weights will change the steepness of the transfer funct. curve. Note this difference between biases and weights, as it will give a good understanding on why they both need to be present in a FFNN.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With