Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's wrong with my implementation of Neural Networks?

Tags:

enter image description hereI want to plot the learning error curve of a neural net with respect to the number of training examples. Here is the code :

import sklearn import numpy as np from sklearn.model_selection import learning_curve import matplotlib.pyplot as plt from sklearn import neural_network from sklearn import cross_validation  myList=[] myList2=[] w=[]  dataset=np.loadtxt("data", delimiter=",") X=dataset[:, 0:6] Y=dataset[:,6] clf=sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(2,3),activation='tanh')  # split the data between training and testing X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.25, random_state=33)  # begin with few training datas X_eff=X_train[0:int(len(X_train)/150), : ] Y_eff=Y_train[0:int(len(Y_train)/150)]  k=int(len(X_train)/150)-1   for m in range (140) :       print (m)      w.append(k)      # train the model and store the training error     A=clf.fit(X_eff,Y_eff)     myList.append(1-A.score(X_eff,Y_eff))        # compute the testing error     myList2.append(1-A.score(X_test,Y_test))      # add some more training datas     X_eff=np.vstack((X_eff,X_train[k+1:k+101,:]))     Y_eff=np.hstack((Y_eff,Y_train[k+1:k+101]))     k=k+100  plt.figure(figsize=(8, 8)) plt.subplots_adjust() plt.title("Erreur d'entrainement et de test") plt.plot(w,myList,label="training error") plt.plot(w,myList2,label="test error") plt.legend() plt.show() 

However, I get a very strange result, with curves fluctuating, the training error very close to the testing error which does not appear to be normal. Where is the mistake? I can't understand why there are so many ups and downs and why the training error does not increase, as it would be expected to.Any help would be appreciated !

EDIT : the dataset I am using is https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29 where I got rid of the classes having less than 1000 instances. I manually re-encoded the litteral data.

like image 363
MysteryGuy Avatar asked Aug 28 '17 12:08

MysteryGuy


People also ask

What are the problems with neural networks?

Neural networks usually require much more data than traditional machine learning algorithms, as in at least thousands if not millions of labeled samples. This isn't an easy problem to deal with and many machine learning problems can be solved well with less data if you use other algorithms.

How hard is it to implement a neural network?

Training deep learning neural networks is very challenging. The best general algorithm known for solving this problem is stochastic gradient descent, where model weights are updated each iteration using the backpropagation of error algorithm. Optimization in general is an extremely difficult task.

What should I do when my neural network does not learn?

First, build a small network with a single hidden layer and verify that it works correctly. Then incrementally add additional model complexity, and verify that each of those works as well. Too few neurons in a layer can restrict the representation that the network learns, causing under-fitting.

Why is my machine learning model not learning?

If you have too many classes, you could try to reduce the number by grouping them first, using a clustering algorithm. There is no data leakage from the training set into the test set. The dataset does not have noisy/empty attributes, too many missing values, or too many outliers.


2 Answers

I think that the reason you're seeing this kind of curve is that the performance metric you are measuring is different from the performance metric that you are optimizing.

Optimization metric

The neural network minimizes a loss function, and in the case of tanh activiations, I assume you are using a modified version of the cross entropy loss. If you were to plot the loss over time, you would see a more monotonically decreasing error function like you expect. (Not actually monotonic because neural networks are non-convex, but that's beside the point.)

Performance metric

The performance metric that you are measuring is the percent accuracy, which is different from the loss. Why are these different? The loss function tells us how much error we have in a differentiable way (which is important for fast optimization methods). The accuracy metric tells us how well we predict, which is useful for application of the neural network.

Putting it together

Because you are plotting the performance of a related metric, you can expect that the plot will look similar to that of your optimized metric. However because they are not the same, you may be introducing some unaccounted-for variance in your plot (as evidenced by the plot you posted).

There are a couple of ways to fix this.

  1. Plot the loss instead of the accuracy. This doesn't actually fix your problem if you actually need the accuracy plot, but it will give you much more smooth curves.
  2. Plot an average over multiple runs. Save the accuracy plots over 20 independent runs of your algorithm (as in train the network 20 times), then average them together and plot this. That will greatly reduce the variance.

TL;DR

Don't expect the accuracy plot to always be smooth and monotonically decreasing, it won't be.


After question edit:

Now that you've added your dataset, I see a few other things that may be causing the issues that you're seeing.

Information in magnitude

The dataset defines the rank and file (row and column) of several chess pieces. These are input as an integer from 1 to 6. However is 2 really 1 better than 1? Is 6 really 4 better than 2? I don't think this is the case in terms of chess position.

Imagine I am building a classifier that takes money as an input. Is there some amount of information being portrayed by the magnitude of my values? Yes, $1 is quite different from $100; and we can tell that there is a relationship based on the magnitude.

For a chess game, does row 1 mean something different than row 8? Not at all, in fact these dimensions are symmetrical! Using a bias unit in your network can help account for the symmetry by "rescaling" your inputs to be effectively from [-3, 4] which is now centered(ish) around 0.

Solutions

I think, however, you would get the most mileage out of tile-coding or one-hot encoding each of your features. Don't allow the network to rely on the information contained in the magnitude of each feature, as that may be causing the network to work its way into bad local optima.

like image 90
Andnp Avatar answered Sep 28 '22 04:09

Andnp


In addition to the previous answers you should also have in mind, that you might have to tweak the learning rate (by setting learning_rate = value in the initializer) of the network. If you choose the rate to big, you will jump from on local minimum to another or circle around these points, but won't actually converge (see the image below, taken from here).

Furthermore, please also plot the loss and not just the accuracy of your network. This will give you better insights about it.

Also, keep in mind, that you have to use a lot of training and test data to get a more or less "smooth" curve, or even a representative curve; if you are using just a few (maybe a few hundred) data points, the resulting metrics will not actually be very accurate, as they contain a lot of stochastic things. To solve this error you should not train the network with the same examples every time, but rather change the orders of your training data, and maybe split it up on different mini batches. I am very confident, that you can solve or even reduce your problem by trying to mind these aspects and to implement them.

Depending on your kind of problem, you should change the activation function to something different than the tanh function. Performing a classification, a OneHotEncoder might also be useful (if your data is not already one hot encoded); the sklearn framework is offering an implementation of this, too.

like image 35
zimmerrol Avatar answered Sep 28 '22 04:09

zimmerrol