In what order should we tune hyperparameters in Neural Networks?

Tags:

I have a quite simple ANN using Tensorflow and AdamOptimizer for a regression problem and I am now at the point to tune all the hyperparameters.

For now, I saw many different hyperparameters that I have to tune :

Learning rate : initial learning rate, learning rate decay
The AdamOptimizer needs 4 arguments (learning-rate, beta1, beta2, epsilon) so we need to tune them - at least epsilon
batch-size
nb of iterations
Lambda L2-regularization parameter
Number of neurons, number of layers
what kind of activation function for the hidden layers, for the output layer
dropout parameter

I have 2 questions :

1) Do you see any other hyperparameter I might have forgotten ?

2) For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way. Is there a special order to tune the parameters ? E.g learning rate first, then batch size, then ... I am not sure that all these parameters are independent - in fact, I am quite sure that some of them are not. Which ones are clearly independent and which ones are clearly not independent ? Should we then tune them together ? Is there any paper or article which talks about properly tuning all the parameters in a special order ?

EDIT : Here are the graphs I got for different initial learning rates, batch sizes and regularization parameters. The purple curve is completely weird for me... Because the cost decreases like way slowly that the others, but it got stuck at a lower accuracy rate. Is it possible that the model is stuck in a local minimum ?

Accuracy

Cost

For the learning rate, I used the decay : LR(t) = LRI/sqrt(epoch)

Thanks for your help ! Paul

575

asked May 26 '16 17:05

Paul Rolin

2 Answers

My general order is:

Batch size, as it will largely affect the training time of future experiments.
Architecture of the network:
- Number of neurons in the network
- Number of layers
Rest (dropout, L2 reg, etc.)

Dependencies:

I'd assume that the optimal values of

learning rate and batch size
learning rate and number of neurons
number of neurons and number of layers

strongly depend on each other. I am not an expert on that field though.

As for your hyperparameters:

For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source)
For the learning rate with Adam and RMSProp, I found values around 0.001 to be optimal for most problems.
As an alternative to Adam, you can also use RMSProp, which reduces the memory footprint by up to 33%. See this answer for more details.
You could also tune the initial weight values (see All you need is a good init). Although, the Xavier initializer seems to be a good way to prevent having to tune the weight inits.
I don't tune the number of iterations / epochs as a hyperparameter. I train the net until its validation error converges. However, I give each run a time budget.

answered Oct 07 '22 08:10

Kilian Batzner

Get Tensorboard running. Plot the error there. You'll need to create subdirectories in the path where TB looks for the data to plot. I do that subdir creation in the script. So I change a parameter in the script, give the trial a name there, run it, and plot all the trials in the same chart. You'll very soon get a feel for the most effective settings for your graph and data.

answered Oct 07 '22 08:10

Phillip Bock

Related questions
                            
                                TypeScript: 'super' must be called before before accessing 'this' in the constructor of a derived class
                            
                                How to bring credentials into a Docker container during build
                            
                                Java lambdas have different variable requirements than anonymous inner classes
                            
                                Django: NotImplementedError: annotate() + distinct(fields) is not implemented
                            
                                SMTP configuration not working in production
                            
                                Cordova fingerprint authentication on server
                            
                                Migrating Linq to SQL code to .Net Core
                            
                                Using partitionBy on a DataFrameWriter writes directory layout with column names not just values
                            
                                Why is a local function not always hidden in C#7?
                            
                                GTK+ 3 StatusIcon replacement
                            
                                How do I persist state for android apps killed in the background in react-native
                            
                                What is a git boundary commit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With