Logo Questions Linux Laravel Mysql Ubuntu Git Menu

TensorFlow - introducing both L2 regularization and dropout into the network. Does it makes any sense?


I am currently playing with ANN which is part of Udactity DeepLearning course.

I successful built and train network and introduced the L2 regularization on all weights and biases. Right now I am trying out the dropout for hidden layer in order to improve generalization. I wonder, does it makes sense to both introduce the L2 regularization into the hidden layer and dropout on that same layer? If so, how to do this properly?

During dropout we literally switch off half of the activations of hidden layer and double the amount outputted by rest of the neurons. While using the L2 we compute the L2 norm on all hidden weights. But I am not sure how to compute L2 in case we use dropout. We switch off some activations, shouldn't we remove the weights which are 'not used' now from the L2 calculation? Any references on that matter will be useful, I haven't found any info.

Just in case you are interested, my code for ANN with L2 regularization is below:

#for NeuralNetwork model code is below #We will use SGD for training to save our time. Code is from Assignment 2 #beta is the new parameter - controls level of regularization. Default is 0.01 #but feel free to play with it #notice, we introduce L2 for both biases and weights of all layers  beta = 0.01  #building tensorflow graph graph = tf.Graph() with graph.as_default():       # Input data. For the training data, we use a placeholder that will be fed   # at run time with a training minibatch.   tf_train_dataset = tf.placeholder(tf.float32,                                     shape=(batch_size, image_size * image_size))   tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))   tf_valid_dataset = tf.constant(valid_dataset)   tf_test_dataset = tf.constant(test_dataset)    #now let's build our new hidden layer   #that's how many hidden neurons we want   num_hidden_neurons = 1024   #its weights   hidden_weights = tf.Variable(     tf.truncated_normal([image_size * image_size, num_hidden_neurons]))   hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))    #now the layer itself. It multiplies data by weights, adds biases   #and takes ReLU over result   hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)    #time to go for output linear layer   #out weights connect hidden neurons to output labels   #biases are added to output labels     out_weights = tf.Variable(     tf.truncated_normal([num_hidden_neurons, num_labels]))      out_biases = tf.Variable(tf.zeros([num_labels]))      #compute output     out_layer = tf.matmul(hidden_layer,out_weights) + out_biases   #our real output is a softmax of prior result   #and we also compute its cross-entropy to get our loss   #Notice - we introduce our L2 here   loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(     out_layer, tf_train_labels) +     beta*tf.nn.l2_loss(hidden_weights) +     beta*tf.nn.l2_loss(hidden_biases) +     beta*tf.nn.l2_loss(out_weights) +     beta*tf.nn.l2_loss(out_biases)))    #now we just minimize this loss to actually train the network   optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)    #nice, now let's calculate the predictions on each dataset for evaluating the   #performance so far   # Predictions for the training, validation, and test data.   train_prediction = tf.nn.softmax(out_layer)   valid_relu = tf.nn.relu(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)   valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases)     test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)   test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)    #now is the actual training on the ANN we built #we will run it for some number of steps and evaluate the progress after  #every 500 steps  #number of steps we will train our ANN num_steps = 3001  #actual training with tf.Session(graph=graph) as session:   tf.initialize_all_variables().run()   print("Initialized")   for step in range(num_steps):     # Pick an offset within the training data, which has been randomized.     # Note: we could use better randomization across epochs.     offset = (step * batch_size) % (train_labels.shape[0] - batch_size)     # Generate a minibatch.     batch_data = train_dataset[offset:(offset + batch_size), :]     batch_labels = train_labels[offset:(offset + batch_size), :]     # Prepare a dictionary telling the session where to feed the minibatch.     # The key of the dictionary is the placeholder node of the graph to be fed,     # and the value is the numpy array to feed to it.     feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}     _, l, predictions = session.run(       [optimizer, loss, train_prediction], feed_dict=feed_dict)     if (step % 500 == 0):       print("Minibatch loss at step %d: %f" % (step, l))       print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))       print("Validation accuracy: %.1f%%" % accuracy(         valid_prediction.eval(), valid_labels))       print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels)) 
like image 931
Maksim Khaitovich Avatar asked Jul 10 '16 14:07

Maksim Khaitovich

People also ask

When should one use L1 and L2 regularization instead of dropout to reduce overfitting?

L2 reduces the contribution of high outlier neurons (those significantly larger than the median) and prevents any one neuron from exploding. This also forces the network to diversify. L1 should really be in its own category, as it is most useful for features selection and small networks.

What are the benefits of using L2 regularization of weights in a network?

As it turns out, overfitting is often characterized by weights with large magnitudes, such as -20.503 and 63.812, rather than small magnitudes such as 2.057 and -1.004. L2 regularization tries to reduce the possibility of overfitting by keeping the values of the weights and biases small.

Is dropout better than L2?

The results show that dropout is more effective than L 2 -norm for complex networks i.e., containing large numbers of hidden neurons. The results of this study are helpful to design the neural networks with suitable choice of regularization.

What is regularization and why it is required in deep neural networks?

Regularization is a set of techniques that can prevent overfitting in neural networks and thus improve the accuracy of a Deep Learning model when facing completely new data from the problem domain.

2 Answers

Ok, after some additional efforts I managed to solve it and introduce both L2 and dropout into my network, code is below. I got slight improvement over the same network without the dropout (with L2 in place). I am still not sure if it really worth the effort to introduce both of them, L2 and dropout but at least it works and slightly improves the results.

#ANN with introduced dropout #This time we still use the L2 but restrict training dataset #to be extremely small  #get just first 500 of examples, so that our ANN can memorize whole dataset train_dataset_2 = train_dataset[:500, :] train_labels_2 = train_labels[:500]  #batch size for SGD and beta parameter for L2 loss batch_size = 128 beta = 0.001  #that's how many hidden neurons we want num_hidden_neurons = 1024  #building tensorflow graph graph = tf.Graph() with graph.as_default():   # Input data. For the training data, we use a placeholder that will be fed   # at run time with a training minibatch.   tf_train_dataset = tf.placeholder(tf.float32,                                     shape=(batch_size, image_size * image_size))   tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))   tf_valid_dataset = tf.constant(valid_dataset)   tf_test_dataset = tf.constant(test_dataset)    #now let's build our new hidden layer   #its weights   hidden_weights = tf.Variable(     tf.truncated_normal([image_size * image_size, num_hidden_neurons]))   hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))    #now the layer itself. It multiplies data by weights, adds biases   #and takes ReLU over result   hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)    #add dropout on hidden layer   #we pick up the probabylity of switching off the activation   #and perform the switch off of the activations   keep_prob = tf.placeholder("float")   hidden_layer_drop = tf.nn.dropout(hidden_layer, keep_prob)      #time to go for output linear layer   #out weights connect hidden neurons to output labels   #biases are added to output labels     out_weights = tf.Variable(     tf.truncated_normal([num_hidden_neurons, num_labels]))      out_biases = tf.Variable(tf.zeros([num_labels]))      #compute output   #notice that upon training we use the switched off activations   #i.e. the variaction of hidden_layer with the dropout active   out_layer = tf.matmul(hidden_layer_drop,out_weights) + out_biases   #our real output is a softmax of prior result   #and we also compute its cross-entropy to get our loss   #Notice - we introduce our L2 here   loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(     out_layer, tf_train_labels) +     beta*tf.nn.l2_loss(hidden_weights) +     beta*tf.nn.l2_loss(hidden_biases) +     beta*tf.nn.l2_loss(out_weights) +     beta*tf.nn.l2_loss(out_biases)))    #now we just minimize this loss to actually train the network   optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)    #nice, now let's calculate the predictions on each dataset for evaluating the   #performance so far   # Predictions for the training, validation, and test data.   train_prediction = tf.nn.softmax(out_layer)   valid_relu = tf.nn.relu(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)   valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases)     test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)   test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)    #now is the actual training on the ANN we built #we will run it for some number of steps and evaluate the progress after  #every 500 steps  #number of steps we will train our ANN num_steps = 3001  #actual training with tf.Session(graph=graph) as session:   tf.initialize_all_variables().run()   print("Initialized")   for step in range(num_steps):     # Pick an offset within the training data, which has been randomized.     # Note: we could use better randomization across epochs.     offset = (step * batch_size) % (train_labels_2.shape[0] - batch_size)     # Generate a minibatch.     batch_data = train_dataset_2[offset:(offset + batch_size), :]     batch_labels = train_labels_2[offset:(offset + batch_size), :]     # Prepare a dictionary telling the session where to feed the minibatch.     # The key of the dictionary is the placeholder node of the graph to be fed,     # and the value is the numpy array to feed to it.     feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : 0.5}     _, l, predictions = session.run(       [optimizer, loss, train_prediction], feed_dict=feed_dict)     if (step % 500 == 0):       print("Minibatch loss at step %d: %f" % (step, l))       print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))       print("Validation accuracy: %.1f%%" % accuracy(         valid_prediction.eval(), valid_labels))       print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels)) 
like image 188
Maksim Khaitovich Avatar answered Oct 14 '22 23:10

Maksim Khaitovich

There is no downside to use multiple regularizations. In fact there is a paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting where authors checked how much it helps. Clearly for different datasets you will have different results, but for your MNIST:

enter image description here

you can see that Dropout + Max-norm gives the lowest error. Apart of this you have a big error in your code.

You use l2_loss on weights and biases:

beta*tf.nn.l2_loss(hidden_weights) + beta*tf.nn.l2_loss(hidden_biases) + beta*tf.nn.l2_loss(out_weights) + beta*tf.nn.l2_loss(out_biases))) 

You should not penalize high biases. So remove l2_loss over biases.

like image 24
Salvador Dali Avatar answered Oct 14 '22 21:10

Salvador Dali