I am playing with a ANN which is part of Udacity DeepLearning course.
I have an assignment which involves introducing generalization to the network with one hidden ReLU layer using L2 loss. I wonder how to properly introduce it so that ALL weights are penalized, not only weights of the output layer.
Code for network without generalization is at the bottom of the post (code to actually run the training is out of the scope of the question).
Obvious way of introducing the L2 is to replace the loss calculation with something like this (if beta is 0.01):
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(out_layer, tf_train_labels) + 0.01*tf.nn.l2_loss(out_weights)) But in such case it will take into account values of output layer's weights. I am not sure, how do we properly penalize the weights which come INTO the hidden ReLU layer. Is it needed at all or introducing penalization of output layer will somehow keep the hidden weights in check also?
#some importing from __future__ import print_function import numpy as np import tensorflow as tf from six.moves import cPickle as pickle from six.moves import range  #loading data pickle_file = '/home/maxkhk/Documents/Udacity/DeepLearningCourse/SourceCode/tensorflow/examples/udacity/notMNIST.pickle'  with open(pickle_file, 'rb') as f:   save = pickle.load(f)   train_dataset = save['train_dataset']   train_labels = save['train_labels']   valid_dataset = save['valid_dataset']   valid_labels = save['valid_labels']   test_dataset = save['test_dataset']   test_labels = save['test_labels']   del save  # hint to help gc free up memory   print('Training set', train_dataset.shape, train_labels.shape)   print('Validation set', valid_dataset.shape, valid_labels.shape)   print('Test set', test_dataset.shape, test_labels.shape)   #prepare data to have right format for tensorflow #i.e. data is flat matrix, labels are onehot  image_size = 28 num_labels = 10  def reformat(dataset, labels):   dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)   # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]   labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)   return dataset, labels train_dataset, train_labels = reformat(train_dataset, train_labels) valid_dataset, valid_labels = reformat(valid_dataset, valid_labels) test_dataset, test_labels = reformat(test_dataset, test_labels) print('Training set', train_dataset.shape, train_labels.shape) print('Validation set', valid_dataset.shape, valid_labels.shape) print('Test set', test_dataset.shape, test_labels.shape)   #now is the interesting part - we are building a network with #one hidden ReLU layer and out usual output linear layer  #we are going to use SGD so here is our size of batch batch_size = 128  #building tensorflow graph graph = tf.Graph() with graph.as_default():       # Input data. For the training data, we use a placeholder that will be fed   # at run time with a training minibatch.   tf_train_dataset = tf.placeholder(tf.float32,                                     shape=(batch_size, image_size * image_size))   tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))   tf_valid_dataset = tf.constant(valid_dataset)   tf_test_dataset = tf.constant(test_dataset)    #now let's build our new hidden layer   #that's how many hidden neurons we want   num_hidden_neurons = 1024   #its weights   hidden_weights = tf.Variable(     tf.truncated_normal([image_size * image_size, num_hidden_neurons]))   hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))    #now the layer itself. It multiplies data by weights, adds biases   #and takes ReLU over result   hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)    #time to go for output linear layer   #out weights connect hidden neurons to output labels   #biases are added to output labels     out_weights = tf.Variable(     tf.truncated_normal([num_hidden_neurons, num_labels]))      out_biases = tf.Variable(tf.zeros([num_labels]))      #compute output     out_layer = tf.matmul(hidden_layer,out_weights) + out_biases   #our real output is a softmax of prior result   #and we also compute its cross-entropy to get our loss   loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(out_layer, tf_train_labels))    #now we just minimize this loss to actually train the network   optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)    #nice, now let's calculate the predictions on each dataset for evaluating the   #performance so far   # Predictions for the training, validation, and test data.   train_prediction = tf.nn.softmax(out_layer)   valid_relu = tf.nn.relu(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)   valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases)     test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)   test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases) Compute a regularization loss on a tensor by directly calling a regularizer as if it is a one-argument function. E.g. >>> regularizer = tf. keras.
The above weight equation is similar to the usual gradient descent learning rule, except the now we first rescale the weights w by (1−(η*λ)/n). This term is the reason why L2 regularization is often referred to as weight decay since it makes the weights smaller.
In L2 regularization we take the sum of all the parameters squared and add it with the square difference of the actual output and predictions. Same as L1 if you increase the value of lambda, the value of the parameters will decrease as L2 will penalize the parameters.
We add regularization by assigning to the kernel_regularizer parameter. To add L2 regularization, we pass keras. regularizers. l2() .
A shorter and scalable way of doing this would be ;
vars   = tf.trainable_variables()  lossL2 = tf.add_n([ tf.nn.l2_loss(v) for v in vars ]) * 0.001 This basically sums the l2_loss of all your trainable variables. You could also make a dictionary where you specify only the variables you want to add to your cost and use the second line above. Then you can add lossL2 with your softmax cross entropy value in order to calculate your total loss.
Edit : As mentioned by Piotr Dabkowski, the code above will also regularise biases. This can be avoided by adding an if statement in the second line ;
lossL2 = tf.add_n([ tf.nn.l2_loss(v) for v in vars                     if 'bias' not in v.name ]) * 0.001 This can be used to exclude other variables.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With