TensorFlow - introducing both L2 regularization and dropout into the network. Does it makes any sense?

Tags:

I am currently playing with ANN which is part of Udactity DeepLearning course.

I successful built and train network and introduced the L2 regularization on all weights and biases. Right now I am trying out the dropout for hidden layer in order to improve generalization. I wonder, does it makes sense to both introduce the L2 regularization into the hidden layer and dropout on that same layer? If so, how to do this properly?

During dropout we literally switch off half of the activations of hidden layer and double the amount outputted by rest of the neurons. While using the L2 we compute the L2 norm on all hidden weights. But I am not sure how to compute L2 in case we use dropout. We switch off some activations, shouldn't we remove the weights which are 'not used' now from the L2 calculation? Any references on that matter will be useful, I haven't found any info.

Just in case you are interested, my code for ANN with L2 regularization is below:

#for NeuralNetwork model code is below #We will use SGD for training to save our time. Code is from Assignment 2 #beta is the new parameter - controls level of regularization. Default is 0.01 #but feel free to play with it #notice, we introduce L2 for both biases and weights of all layers  beta = 0.01  #building tensorflow graph graph = tf.Graph() with graph.as_default():       # Input data. For the training data, we use a placeholder that will be fed   # at run time with a training minibatch.   tf_train_dataset = tf.placeholder(tf.float32,                                     shape=(batch_size, image_size * image_size))   tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))   tf_valid_dataset = tf.constant(valid_dataset)   tf_test_dataset = tf.constant(test_dataset)    #now let's build our new hidden layer   #that's how many hidden neurons we want   num_hidden_neurons = 1024   #its weights   hidden_weights = tf.Variable(     tf.truncated_normal([image_size * image_size, num_hidden_neurons]))   hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))    #now the layer itself. It multiplies data by weights, adds biases   #and takes ReLU over result   hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)    #time to go for output linear layer   #out weights connect hidden neurons to output labels   #biases are added to output labels     out_weights = tf.Variable(     tf.truncated_normal([num_hidden_neurons, num_labels]))      out_biases = tf.Variable(tf.zeros([num_labels]))      #compute output     out_layer = tf.matmul(hidden_layer,out_weights) + out_biases   #our real output is a softmax of prior result   #and we also compute its cross-entropy to get our loss   #Notice - we introduce our L2 here   loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(     out_layer, tf_train_labels) +     beta*tf.nn.l2_loss(hidden_weights) +     beta*tf.nn.l2_loss(hidden_biases) +     beta*tf.nn.l2_loss(out_weights) +     beta*tf.nn.l2_loss(out_biases)))    #now we just minimize this loss to actually train the network   optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)    #nice, now let's calculate the predictions on each dataset for evaluating the   #performance so far   # Predictions for the training, validation, and test data.   train_prediction = tf.nn.softmax(out_layer)   valid_relu = tf.nn.relu(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)   valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases)     test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)   test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)    #now is the actual training on the ANN we built #we will run it for some number of steps and evaluate the progress after  #every 500 steps  #number of steps we will train our ANN num_steps = 3001  #actual training with tf.Session(graph=graph) as session:   tf.initialize_all_variables().run()   print("Initialized")   for step in range(num_steps):     # Pick an offset within the training data, which has been randomized.     # Note: we could use better randomization across epochs.     offset = (step * batch_size) % (train_labels.shape[0] - batch_size)     # Generate a minibatch.     batch_data = train_dataset[offset:(offset + batch_size), :]     batch_labels = train_labels[offset:(offset + batch_size), :]     # Prepare a dictionary telling the session where to feed the minibatch.     # The key of the dictionary is the placeholder node of the graph to be fed,     # and the value is the numpy array to feed to it.     feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}     _, l, predictions = session.run(       [optimizer, loss, train_prediction], feed_dict=feed_dict)     if (step % 500 == 0):       print("Minibatch loss at step %d: %f" % (step, l))       print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))       print("Validation accuracy: %.1f%%" % accuracy(         valid_prediction.eval(), valid_labels))       print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

931

asked Jul 10 '16 14:07

Maksim Khaitovich

2 Answers

Ok, after some additional efforts I managed to solve it and introduce both L2 and dropout into my network, code is below. I got slight improvement over the same network without the dropout (with L2 in place). I am still not sure if it really worth the effort to introduce both of them, L2 and dropout but at least it works and slightly improves the results.

#ANN with introduced dropout #This time we still use the L2 but restrict training dataset #to be extremely small  #get just first 500 of examples, so that our ANN can memorize whole dataset train_dataset_2 = train_dataset[:500, :] train_labels_2 = train_labels[:500]  #batch size for SGD and beta parameter for L2 loss batch_size = 128 beta = 0.001  #that's how many hidden neurons we want num_hidden_neurons = 1024  #building tensorflow graph graph = tf.Graph() with graph.as_default():   # Input data. For the training data, we use a placeholder that will be fed   # at run time with a training minibatch.   tf_train_dataset = tf.placeholder(tf.float32,                                     shape=(batch_size, image_size * image_size))   tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))   tf_valid_dataset = tf.constant(valid_dataset)   tf_test_dataset = tf.constant(test_dataset)    #now let's build our new hidden layer   #its weights   hidden_weights = tf.Variable(     tf.truncated_normal([image_size * image_size, num_hidden_neurons]))   hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))    #now the layer itself. It multiplies data by weights, adds biases   #and takes ReLU over result   hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)    #add dropout on hidden layer   #we pick up the probabylity of switching off the activation   #and perform the switch off of the activations   keep_prob = tf.placeholder("float")   hidden_layer_drop = tf.nn.dropout(hidden_layer, keep_prob)      #time to go for output linear layer   #out weights connect hidden neurons to output labels   #biases are added to output labels     out_weights = tf.Variable(     tf.truncated_normal([num_hidden_neurons, num_labels]))      out_biases = tf.Variable(tf.zeros([num_labels]))      #compute output   #notice that upon training we use the switched off activations   #i.e. the variaction of hidden_layer with the dropout active   out_layer = tf.matmul(hidden_layer_drop,out_weights) + out_biases   #our real output is a softmax of prior result   #and we also compute its cross-entropy to get our loss   #Notice - we introduce our L2 here   loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(     out_layer, tf_train_labels) +     beta*tf.nn.l2_loss(hidden_weights) +     beta*tf.nn.l2_loss(hidden_biases) +     beta*tf.nn.l2_loss(out_weights) +     beta*tf.nn.l2_loss(out_biases)))    #now we just minimize this loss to actually train the network   optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)    #nice, now let's calculate the predictions on each dataset for evaluating the   #performance so far   # Predictions for the training, validation, and test data.   train_prediction = tf.nn.softmax(out_layer)   valid_relu = tf.nn.relu(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)   valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases)     test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)   test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)    #now is the actual training on the ANN we built #we will run it for some number of steps and evaluate the progress after  #every 500 steps  #number of steps we will train our ANN num_steps = 3001  #actual training with tf.Session(graph=graph) as session:   tf.initialize_all_variables().run()   print("Initialized")   for step in range(num_steps):     # Pick an offset within the training data, which has been randomized.     # Note: we could use better randomization across epochs.     offset = (step * batch_size) % (train_labels_2.shape[0] - batch_size)     # Generate a minibatch.     batch_data = train_dataset_2[offset:(offset + batch_size), :]     batch_labels = train_labels_2[offset:(offset + batch_size), :]     # Prepare a dictionary telling the session where to feed the minibatch.     # The key of the dictionary is the placeholder node of the graph to be fed,     # and the value is the numpy array to feed to it.     feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : 0.5}     _, l, predictions = session.run(       [optimizer, loss, train_prediction], feed_dict=feed_dict)     if (step % 500 == 0):       print("Minibatch loss at step %d: %f" % (step, l))       print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))       print("Validation accuracy: %.1f%%" % accuracy(         valid_prediction.eval(), valid_labels))       print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

188

answered Oct 14 '22 23:10

Maksim Khaitovich

There is no downside to use multiple regularizations. In fact there is a paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting where authors checked how much it helps. Clearly for different datasets you will have different results, but for your MNIST:

enter image description here

you can see that Dropout + Max-norm gives the lowest error. Apart of this you have a big error in your code.

You use l2_loss on weights and biases:

beta*tf.nn.l2_loss(hidden_weights) + beta*tf.nn.l2_loss(hidden_biases) + beta*tf.nn.l2_loss(out_weights) + beta*tf.nn.l2_loss(out_biases)))

You should not penalize high biases. So remove l2_loss over biases.

answered Oct 14 '22 21:10

Salvador Dali

Related questions
                            
                                C++ parameter pack, constrained to have instances of a single type?
                            
                                How to handle errors in fetch() responses with Redux-Saga?
                            
                                Angular 2 Animation - boolean trigger?
                            
                                Can't create WebStorm React project
                            
                                Angular2: Module and Component difference
                            
                                How can I set index while converting dictionary to dataframe?
                            
                                How to insert datetime string into Mongodb as ISODate using pymongo
                            
                                What's VTIMEZONE used for in icalendar? Why not just UTC time?
                            
                                Is Azure DocumentDB being renamed? [closed]
                            
                                Yahoo Finance API changes (2017)
                            
                                AWS ECS 503 Service Temporarily Unavailable while deploying
                            
                                Facebook strings has no default translation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TensorFlow - introducing both L2 regularization and dropout into the network. Does it makes any sense?

Tags:

Maksim Khaitovich

People also ask

2 Answers

Maksim Khaitovich

Salvador Dali

Recent Activity

Donate For Us