Debugging a Neural Network

TLDR

I have been trying to fit a simple neural network on MNIST, and it works for a small debugging setup, but when I bring it over to a subset of MNIST, it trains super fast and the gradient is close to 0 very quickly, but then it outputs the same value for any given input and the final cost is quite high. I had been trying to purposefully overfit to make sure it is in fact working but it will not do so on MNIST suggesting a deep problem in the setup. I have checked my backpropagation implementation using gradient checking and it seems to match up, so not sure where the error lies, or what to work on now!

Many thanks for any help you can offer, I've been struggling to fix this!

Explanation

I have been trying to make a neural network in Numpy, based on this explanation: http://ufldl.stanford.edu/wiki/index.php/Neural_Networks http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

Backpropagation seems to match gradient checking:

Click to copy

Backpropagation: [ 0.01168585,  0.06629858, -0.00112408, -0.00642625, -0.01339408,
   -0.07580145,  0.00285868,  0.01628148,  0.00365659,  0.0208475 ,
    0.11194151,  0.16696139,  0.10999967,  0.13873069,  0.13049299,
   -0.09012582, -0.1344335 , -0.08857648, -0.11168955, -0.10506167]
Gradient Checking: [-0.01168585 -0.06629858  0.00112408  0.00642625  0.01339408  
    0.07580145 -0.00285868 -0.01628148 -0.00365659 -0.0208475  
   -0.11194151 -0.16696139 -0.10999967 -0.13873069 -0.13049299  
    0.09012582  0.1344335   0.08857648  0.11168955  0.10506167]

And when I train on this simple debug setup:

Click to copy

a is a neural net w/ 2 inputs -> 5 hidden -> 2 outputs, and learning rate 0.5
a.gradDesc(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]]))
ie. x1 = [0.1, 0.9] and y1 = [0,1]

I get these lovely training curves error on iterations gradient on iterations

Admittedly this is clearly a dumbed down, very easy function to fit. However as soon as I bring it over to MNIST, with this setup:

Click to copy

    # Number of input, hidden and ouput nodes
    # Input = 28 x 28 pixels
    input_nodes=784
    # Arbitrary number of hidden nodes, experiment to improve
    hidden_nodes=200
    # Output = one of the digits [0,1,2,3,4,5,6,7,8,9]
    output_nodes=10

    # Learning rate
    learning_rate=0.4

    # Regularisation parameter
    lambd=0.0

With this setup run on the code below, for 100 iterations, it does seem to train at first then just "flat lines" quite quickly and doesnt achieve a very good model:

Click to copy

Initial ===== Cost (unregularised):  2.09203670985  /// Cost (regularised):      2.09203670985    Mean Gradient:  0.0321241229793
Iteration  100  Cost (unregularised):  0.980999805477   /// Cost (regularised):  0.980999805477    Mean Gradient:  -5.29639499854e-09
TRAINED IN  26.45932364463806

This then gives really poor test accuracy and predicts the same output, even when tested with all inputs being 0.1 or all 0.9 I just get the same output (although precisely which number it outputs varies depending on initial random weights):

Click to copy

Test accuracy:  8.92
Targets    2  2  1  7  2  2  0  2  3  
Hypothesis 5  5  5  5  5  5  5  5  5

And the curves for the MNIST Training: enter image description here

Code dump:

Click to copy

# Import dependencies
import numpy as np
import time
import csv
import matplotlib.pyplot
import random
import math

# Read in training data
with open('MNIST/mnist_train_100.csv') as file:
    train_data=np.array([list(map(int,line.strip().split(','))) for line in file.readlines()])


# In[197]:

# Plot a sample of training data to visualise
displayData(train_data[:,1:], 25)


# In[198]:

# Read in test data
with open('MNIST/mnist_test.csv') as file:
    test_data=np.array([list(map(int,line.strip().split(','))) for line in file.readlines()])

# Main neural network class
class neuralNetwork:
    # Define the architecture
    def __init__(self, i, h, o, lr, lda):
        # Number of nodes in each layer
        self.i=i
        self.h=h
        self.o=o
        # Learning rate
        self.lr=lr
        # Lambda for regularisation
        self.lda=lda
        
        # Randomly initialise the parameters, input-> hidden and hidden-> output
        self.ih=np.random.normal(0.0,pow(self.h,-0.5),(self.h,self.i))
        self.ho=np.random.normal(0.0,pow(self.o,-0.5),(self.o,self.h))
   
    def predict(self, X):
        # GET HYPOTHESIS ESTIMATES/ OUTPUTS
        # Add bias node x(0)=1 for all training examples, X is now m x n+1
        # Then compute activation to hidden node
        z2=np.dot(X,self.ih.T) + 1
        #print(a1.shape)
        a2=sigmoid(z2)
        #print(ha)
        # Add bias node h(0)=1 for all training examples, H is now m x h+1
        # Then compute activation to output node
        z3=np.dot(a2,self.ho.T) + 1
        h=sigmoid(z3)
        outputs=np.argmax(h.T,axis=0)
        
        return outputs

    def backprop (self, X, y):
        try:
            m = X.shape[0]
        except:
            m=1

        # GET HYPOTHESIS ESTIMATES/ OUTPUTS
        # Add bias node x(0)=1 for all training examples, X is now m x n+1
        # Then compute activation to hidden node
        z2=np.dot(X,self.ih.T)
        #print(a1.shape)
        a2=sigmoid(z2)
        #print(ha)
        # Add bias node h(0)=1 for all training examples, H is now m x h+1
        # Then compute activation to output node
        z3=np.dot(a2,self.ho.T)
        h=sigmoid(z3)

        # Compute error/ cost for this setup (unregularised and regularise)
        costReg=self.costFunc(h,y)
        costUn=self.costFuncReg(h,y)

        # Output error term
        d3=-(y-h)*sigmoidGradient(z3)

        # Hidden error term
        d2=np.dot(d3,self.ho)*sigmoidGradient(z2)

        # Partial derivatives for weights
        D2=np.dot(d3.T,a2)
        D1=np.dot(d2.T,X)

        # Partial derivatives of theta with regularisation
        T2Grad=(D2/m)+(self.lda/m)*(self.ho)
        T1Grad=(D1/m)+(self.lda/m)*(self.ih)

        # Update weights
        # Hidden layer (weights 1)
        self.ih-=self.lr*(((D1)/m) + (self.lda/m)*self.ih)
        # Output layer (weights 2)
        self.ho-=self.lr*(((D2)/m) + (self.lda/m)*self.ho)

        # Unroll gradients to one long vector
        grad=np.concatenate(((T1Grad).ravel(),(T2Grad).ravel()))

        return costReg, costUn, grad
 
    def backpropIter (self, X, y):
        try:
            m = X.shape[0]
        except:
            m=1
        
        # GET HYPOTHESIS ESTIMATES/ OUTPUTS
        # Add bias node x(0)=1 for all training examples, X is now m x n+1
        # Then compute activation to hidden node
        z2=np.dot(X,self.ih.T)
        #print(a1.shape)
        a2=sigmoid(z2)
        #print(ha)
        # Add bias node h(0)=1 for all training examples, H is now m x h+1
        # Then compute activation to output node
        z3=np.dot(a2,self.ho.T)
        h=sigmoid(z3)
        
        # Compute error/ cost for this setup (unregularised and regularise)
        costUn=self.costFunc(h,y)
        costReg=self.costFuncReg(h,y)
        
        gradW1=np.zeros(self.ih.shape)
        gradW2=np.zeros(self.ho.shape)
        for i in range(m):
            delta3 = -(y[i,:]-h[i,:])*sigmoidGradient(z3[i,:])
            delta2 = np.dot(self.ho.T,delta3)*sigmoidGradient(z2[i,:])
            
            gradW2= gradW2 + np.outer(delta3,a2[i,:])
            gradW1 = gradW1 + np.outer(delta2,X[i,:])

        # Update weights
        # Hidden layer (weights 1)
        #self.ih-=self.lr*(((gradW1)/m) + (self.lda/m)*self.ih)
        # Output layer (weights 2)
        #self.ho-=self.lr*(((gradW2)/m) + (self.lda/m)*self.ho)

        # Unroll gradients to one long vector
        grad=np.concatenate(((gradW1).ravel(),(gradW2).ravel()))
        
        return costUn, costReg, grad
    
    def gradDesc(self, X, y):
        # Backpropagate to get updates
        cost,costreg,grad=self.backpropIter(X,y)

        # Unroll parameters
        deltaW1=np.reshape(grad[0:self.h*self.i],(self.h,self.i))
        deltaW2=np.reshape(grad[self.h*self.i:],(self.o,self.h))
        
        # m = no. training examples
        m=X.shape[0]
        #print (self.ih)
        self.ih -= self.lr * ((deltaW1))#/m) + (self.lda * self.ih)) 
        self.ho -= self.lr * ((deltaW2))#/m) + (self.lda * self.ho)) 
        #print(deltaW1)
        #print(self.ih)
        return cost,costreg,grad
        
    
    # Gradient checking to compute the gradient numerically to debug backpropagation
    def gradCheck(self, X, y):
        # Unroll theta
        theta=np.concatenate(((self.ih).ravel(),(self.ho).ravel()))
        # perturb will add and subtract epsilon, numgrad will store answers
        perturb=np.zeros(len(theta))
        numgrad=np.zeros(len(theta))
        # epsilon, e is a small number
        e = 0.00001
        # Loop over all theta
        for i in range(len(theta)):
            # Perturb is zeros with one index being e
            perturb[i]=e
            loss1=self.costFuncGradientCheck(theta-perturb, X, y)
            loss2=self.costFuncGradientCheck(theta+perturb, X, y)
            # Compute numerical gradient and update vectors
            numgrad[i]=(loss1-loss2)/(2*e)
            perturb[i]=0
        return numgrad
        
    def costFuncGradientCheck(self,theta,X,y):
        T1=np.reshape(theta[0:self.h*self.i],(self.h,self.i))
        T2=np.reshape(theta[self.h*self.i:],(self.o,self.h))
        m=X.shape[0]
        # GET HYPOTHESIS ESTIMATES/ OUTPUTS
        # Compute activation to hidden node
        z2=np.dot(X,T1.T)
        a2=sigmoid(z2)
        # Compute activation to output node
        z3=np.dot(a2,T2.T)
        h=sigmoid(z3)
        
        cost=self.costFunc(h, y)
        return cost #+ ((self.lda/2)*(np.sum(pow(T1,2)) + np.sum(pow(T2,2))))
    
    def costFunc(self, h, y):
        m=h.shape[0]
        return np.sum(pow((h-y),2))/m
        
    def costFuncReg(self, h, y):
        cost=self.costFunc(h, y)
        return cost #+ ((self.lda/2)*(np.sum(pow(self.ih,2)) + np.sum(pow(self.ho,2))))
        
# Helper functions to compute sigmoid and gradient for an input number or matrix
def sigmoid(Z):
    return np.divide(1,np.add(1,np.exp(-Z)))
def sigmoidGradient(Z):
    return sigmoid(Z)*(1-sigmoid(Z))

# Pre=processing helper functions
# Normalise data to 0.1-1 as 0 inputs kills the weights and changes
def scaleDataVec(data):
    return (np.asfarray(data[1:]) / 255.0 * 0.99) + 0.1

def scaleData(data):
    return (np.asfarray(data[:,1:]) / 255.0 * 0.99) + 0.1

# DISPLAY DATA
# plot_data will be what to plot, num_ex must be a square number of how many examples to plot, random examples will then be plotted
def displayData(plot_data, num_ex, rand=1):
    if rand==0:
        data=plot_data
    else:
        rand_indexes=random.sample(range(plot_data.shape[0]),num_ex)
        data=plot_data[rand_indexes,:]
    # Useful variables, m= no. train ex, n= no. features
    m=data.shape[0]
    n=data.shape[1]
    # Shape for one example
    example_width=math.ceil(math.sqrt(n))
    example_height=math.ceil(n/example_width)
    # No. of items to display
    display_rows=math.floor(math.sqrt(m))
    display_cols=math.ceil(m/display_rows)
    # Padding between images
    pad=1
    # Setup blank display
    display_array = -np.ones((pad + display_rows * (example_height + pad), (pad + display_cols * (example_width + pad))))
    curr_ex=0
    for i in range(1,display_rows+1):
        for j in range(1,display_cols+1):
            if curr_ex>m:
                break
            # Max value of this patch
            max_val=max(abs(data[curr_ex, :]))
            display_array[pad + (j-1) * (example_height + pad) : j*(example_height+1), pad + (i-1) * (example_width + pad) :                           i*(example_width+1)] = data[curr_ex, :].reshape(example_height, example_width)/max_val
            curr_ex+=1

    matplotlib.pyplot.imshow(display_array, cmap='Greys', interpolation='None')


# In[312]:

a=neuralNetwork(2,5,2,0.5,0.0)
print(a.backpropIter(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]])))
print(a.gradCheck(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]])))
D=[]
C=[]
for i in range(100):
    c,b,d=a.gradDesc(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]]))
    C.append(c)
    D.append(np.mean(d))
    #print(c)
    
print(a.predict(np.array([[0.1,0.9]])))
# Debugging plot
matplotlib.pyplot.figure()
matplotlib.pyplot.plot(C)
matplotlib.pyplot.ylabel("Error")
matplotlib.pyplot.xlabel("Iterations")
matplotlib.pyplot.figure()
matplotlib.pyplot.plot(D)
matplotlib.pyplot.ylabel("Gradient")
matplotlib.pyplot.xlabel("Iterations")
#print(J)


# In[313]:

# Class instance

# Number of input, hidden and ouput nodes
# Input = 28 x 28 pixels
input_nodes=784
# Arbitrary number of hidden nodes, experiment to improve
hidden_nodes=200
# Output = one of the digits [0,1,2,3,4,5,6,7,8,9]
output_nodes=10

# Learning rate
learning_rate=0.4

# Regularisation parameter
lambd=0.0

# Create instance of Nnet class
nn=neuralNetwork(input_nodes,hidden_nodes,output_nodes,learning_rate,lambd)


# In[314]:

time1=time.time()
# Scale inputs
inputs=scaleData(train_data)
# 0.01-0.99 range as the sigmoid function can't reach 0 or 1, 0.01 for all except 0.99 for target
targets=(np.identity(output_nodes)*0.98)[train_data[:,0],:]+0.01
J=[]
JR=[]
Grad=[]
iterations=100
for i in range(iterations):
    j,jr,grad=nn.gradDesc(inputs, targets)
    grad=np.mean(grad)
    if i == 0:
        print("Initial ===== Cost (unregularised): ", j, "\t///", "Cost (regularised): ",jr,"   Mean Gradient: ",grad)
    print("\r", end="")
    print("Iteration ", i+1, "\tCost (unregularised): ", j, "\t///", "Cost (regularised): ", jr,"   Mean Gradient: ",grad,end="")
    J.append(j)
    JR.append(jr)
    Grad.append(grad)
time2 = time.time()
print ("\nTRAINED IN ",time2-time1)


# In[315]:

# Debugging plot
matplotlib.pyplot.figure()
matplotlib.pyplot.plot(J)
matplotlib.pyplot.plot(JR)
matplotlib.pyplot.ylabel("Error")
matplotlib.pyplot.xlabel("Iterations")
matplotlib.pyplot.figure()
matplotlib.pyplot.plot(Grad)
matplotlib.pyplot.ylabel("Gradient")
matplotlib.pyplot.xlabel("Iterations")
#print(J)


# In[316]:

# Scale inputs
inputs=scaleData(test_data)
# 0.01-0.99 range as the sigmoid function can't reach 0 or 1, 0.01 for all except 0.99 for target
targets=test_data[:,0]
h=nn.predict(inputs)
score=[]
targ=[]
hyp=[]
for i,line in enumerate(targets):
    if line == h[i]:
        score.append(1)
    else:
        score.append(0)
    hyp.append(h[i])
    targ.append(line)
print("Test accuracy: ", sum(score)/len(score)*100)
indexes=random.sample(range(len(hyp)),9)
print("Targets    ",end="")
for j in indexes:
    print (targ[j]," ",end="")
print("\nHypothesis ",end="")
for j in indexes:
    print (hyp[j]," ",end="")
displayData(test_data[indexes, 1:], 9, rand=0)


# In[277]:

nn.predict(0.9*np.ones((784,)))

Edit 1

Suggested to use different learning rates, but unfortunately, they all come out with similar results, here are the plots for 30 iterations, using the MNIST 100 subset:

enter image description here

Concretely, here are the figures that they start and end with:

Click to copy

    Initial ===== Cost (unregularised):  4.07208963507  /// Cost (regularised):  4.07208963507    Mean Gradient:  0.0540251381858
    Iteration  50   Cost (unregularised):  0.613310215166   /// Cost (regularised):  0.613310215166    Mean Gradient:  -0.000133981500849Initial ===== Cost (unregularised):  5.67535252616     /// Cost (regularised):  5.67535252616    Mean Gradient:  0.0644797515914
    Iteration  50   Cost (unregularised):  0.381080434935   /// Cost (regularised):  0.381080434935    Mean Gradient:  0.000427866902699Initial ===== Cost (unregularised):  3.54658422176  /// Cost (regularised):  3.54658422176    Mean Gradient:  0.0672211732868
    Iteration  50   Cost (unregularised):  0.981    /// Cost (regularised):  0.981    Mean Gradient:  2.34515341943e-20Initial ===== Cost (unregularised):  4.05269658215   /// Cost (regularised):  4.05269658215    Mean Gradient:  0.0469666696193
    Iteration  50   Cost (unregularised):  0.980999999999   /// Cost (regularised):  0.980999999999    Mean Gradient:  -1.0582706063e-14Initial ===== Cost (unregularised):  2.40881492228  /// Cost (regularised):  2.40881492228    Mean Gradient:  0.0516056901574
    Iteration  50   Cost (unregularised):  1.74539997258    /// Cost (regularised):  1.74539997258    Mean Gradient:  1.01955789614e-09Initial ===== Cost (unregularised):  2.58498876008   /// Cost (regularised):  2.58498876008    Mean Gradient:  0.0388768685257
    Iteration  3    Cost (unregularised):  1.72520399313    /// Cost (regularised):  1.72520399313    Mean Gradient:  0.0134040908157
    Iteration  50   Cost (unregularised):  0.981    /// Cost (regularised):  0.981    Mean Gradient:  -4.49319474346e-43Initial ===== Cost (unregularised):  4.40141352357  /// Cost (regularised):  4.40141352357    Mean Gradient:  0.0689167742968
    Iteration  50   Cost (unregularised):  0.981    /// Cost (regularised):  0.981    Mean Gradient:  -1.01563966458e-22

A learning rate of 0.01, quite low, has the best outcome, but exploring learning rates in this region, I only came out with 30-40% accuracy, a big improvement on the 8% or even 0% that I had seen previously, but not really what it should be achieving!

Edit 2

I've now finished and added a backpropagation function optimized for matrices rather than the iterative formula, and so now I can run it on large epochs/ iterations without painfully slow. So the "backprop" function of the class matches with gradient check (in fact it is 1/2 the size but I think that is a problem in gradient check, so we'll leave that bc it should not matter proportionally and I have tried with adding in divisions to solve this). With large numbers of epochs I achieved a much better accuracy, but still there seems to be a problem, as when I have previously programmed a slightly different style of simple 3 layer neural network a part of a book, on the same dataset csvs, I get a much better training result. Here are some plots and data for large epochs.

enter image description here

Looks good but, we still have a pretty poor test set accuracy, and this is for 2,500 runs through the dataset, should be getting a good result with much less!

Click to copy

    Test accuracy:  61.150000000000006
    Targets    6  9  8  2  2  2  4  3  8  
    Hypothesis 6  9  8  4  7  1  4  3  8

Edit 3, what dataset?

http://makeyourownneuralnetwork.blogspot.co.uk/2015/03/the-mnist-dataset-of-handwitten-digits.html?m=1

Used train.csv and test.csv to try with more data and no better just takes longer so been using the subset train_100 and test_10 while I debug.

Edit 4

Seems to learn something after a very large number of epochs (like 14,000), as the whole dataset is used in the backprop function (not backpropiter) each loop is effectively an epoch, and with a ridiculous amount of epochs on the subset of 100 train and 10 test samples, the test accuracy is quite good. However with this small a sample this could easily be due to just chance and even then it's only 70% percent not what you'd be aiming for even on the small dataset. But it does show that it seems to be learning, I am trying parameters very extensively to rule that out.

653

asked Feb 09 '17 15:02

olliejday

1 Answers

Solved

I solved my neural network. A brief description follows in case it helps anyone else. Thanks to all those that helped with suggestions. Basically, I had implemented it with a fully matrix approach ie. the backpropagation uses all examples each time. I later tried implementing it as a vector approach ie. backpropagation with each example. This was when I realised that the matrix approach doesn't update the parameters each example, so one run through this way is NOT the same as one run through each example in turn, effectively the whole training set is backpropagated as one example. Hence, my matrix implementation does work, but after many iterations, which then ends up taking longer than the vector approach anyway! Have opened a new question to learn more about this specific part but there we go, it just needed a lot of iterations with the matrix approach or a more gradual example by example approach.

159

answered Sep 30 '22 15:09

olliejday

Related questions
                            
                                Calling C function from Python (No Return Value)
                            
                                IntelliJ-IDEA's Python type checker - incorrect expected type?
                            
                                What does it mean for statically linking and dynamically linking in Python?
                            
                                "diff -u -B -w" in python?
                            
                                When should a Task be used instead of a coroutine?
                            
                                User defined generic types and collections.abc
                            
                                FutureWarning when unstacking a pandas (v 0.17) dataframe
                            
                                How to enable tab and arrow keys using python win32gui
                            
                                In Django loaddata it throws errors for json format but work properly for yaml format, why?
                            
                                Flask not releasing memory
                            
                                Plotting with scientific axis, changing the number of significant figures
                            
                                Algorithm to exchange the roles of two randomly chosen nodes from a tree moving pointers
                            
                                TensorFlow: Performing this loss computation
                            
                                Django Proxy Field
                            
                                Can we use serializer_class attribute with APIView(django rest framework)?
                            
                                How to save plots from multiple python scripts using an interactive C# process command?
                            
                                Python scraping of javascript web pages fails for https pages only
                            
                                Providing SSL Connections in Python using PKCS#11
                            
                                Efficient way to set elements to zero where mask is True on scipy sparse matrix
                            
                                Pandas uses substantially more memory for storage than asked for

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Debugging a Neural Network

Tags:

python

machine-learning

neural-network

backpropagation

mnist

TLDR

Explanation

Code dump:

Edit 1

Edit 2

Edit 3, what dataset?

Edit 4

olliejday

People also ask

1 Answers

Solved

olliejday

Recent Activity

Donate For Us