Python Keras LSTM learning converges too fast on high loss

Tags:

This is more of a deep learning conceptual problem, and if this is not the right platform I'll take it elsewhere.

I'm trying to use a Keras LSTM sequential model to learn sequences of text and map them to a numeric value (a regression problem).

The thing is, the learning always converges too fast on high loss (both training and testing). I've tried all possible hyperparameters, and I have a feeling it's a local minima issue that causes the model's high bias.

My questions are basically :

How to initialize weights and bias given this problem?
Which optimizer to use?
How deep I should extend the network (I'm afraid that if I use a very deep network, the training time will be unbearable and the model variance will grow)
Should I add more training data?

Input and output are normalized with minmax.

I am using SGD with momentum, currently 3 LSTM layers (126,256,128) and 2 dense layers (200 and 1 output neuron)

I have printed the weights after few epochs and noticed that many weights are zero and the rest are basically have the value of 1 (or very close to it).

Here are some plots from tensorboard : enter image description here

714

asked Sep 14 '17 16:09

NRG

2 Answers

Faster convergence with a very high loss could possibly mean you are facing an exploding gradients problem. Try to use a much lower learning rate like 1e-5 or 1e-6. You can also try techniques like gradient clipping to limit your gradients in case of high learning rates.

Answer 1

Another reason could be initialization of weights, try the below 3 methods:

Method described in this paper https://arxiv.org/abs/1502.01852
Xavier initialization
Random initialization

For many cases 1st initialization method works the best.

Answer 2

You can try different optimizers like

Momentum optimizer
SGD or Gradient descent
Adam optimizer

The choice of your optimizer should be based on the choice of your loss function. For example: for a logistic regression problem with MSE as a loss function, gradient based optimizers will not converge.

Answer 3

How deep or wide your network should be is again fully dependent on which type of network you are using and what the problem is.

As you said you are using a sequential model using LSTM, to learn sequence on text. No doubt your choice of model is good for this problem you can also try 4-5 LSTMs.

Answer 4

If your gradients are going either 0 or infinite, it is called vanishing gradients or it simply means early convergence, try gradient clipping with proper learning rate and the first weight initialization technique.

I am sure this will definitely solve your problem.

answered Oct 12 '22 19:10

Avinash Rai

Consider reducing your batch_size. With large batch_size, it could be that your gradient at some point couldn't find any more variation in your data's stochasticity and for that reason it convergences earlier.

answered Oct 12 '22 18:10

Aziz

Related questions
                            
                                Creating a |N| x |M| matrix from a hash-table
                            
                                Add pip requirements to docker image in runtime
                            
                                with os.scandir() raises AttributeError: __exit__
                            
                                statsmodels add_constant for OLS intercept, what is this actually doing?
                            
                                Sublime Text 3: Anaconda package error connection to localhost timed out
                            
                                vectorize percentile value of column B of column A (for groups)
                            
                                How to remove EOFError: EOF when reading a line?
                            
                                Data order in seaborn heatmap from pivot
                            
                                How to change page size to A4 in python-docx
                            
                                How to round float 0.5 up to 1.0, while still rounding 0.45 to 0.0, as the usual school rounding?
                            
                                Using scikit-learn NMF with a precomputed set of basis vectors (Python)
                            
                                Can a PyMC3 trace be loaded and values accessed without the original model in memory?
                            
                                TensorFlow - tf.layers vs tf.contrib.layers
                            
                                Index out of range when using lambda [duplicate]
                            
                                Pandas - Groupby with conditional formula
                            
                                Improve performance of converting numpy array to MATLAB double
                            
                                Python static method is not always callable
                            
                                Setup in virtualenv: `pip install -e .` vs `python setup.py install`
                            
                                Sorting a list: numbers in ascending, letters in descending
                            
                                Merge MultiIndex columns together into 1 level [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Keras LSTM learning converges too fast on high loss

Tags:

python

tensorflow

deep-learning

keras

lstm

NRG

People also ask

2 Answers

Avinash Rai

Aziz

Recent Activity

Donate For Us