I'm having a hard time conceptualizing the difference between stateful and stateless LSTMs in Keras. My understanding is that at the end of each batch, the "state of the network is reset" in the stateless case, whereas for the stateful case, the state of the network is preserved for each batch, and must then be manually reset at the end of each epoch.
My questions are as follows: 1. In the stateless case, how is the network learning if the state isn't preserved in-between batches? 2. When would one use the stateless vs stateful modes of an LSTM?
Stateless works best when the the sequences you're learning aren't dependent on one another. Sentence-level prediction of a next word might be a good example of when to use stateless. The stateful configuration resets LSTM cell memory every epoch.
Stateful LSTM is used when the whole sequence plays a part in forming the output.
Setting an RNN to be stateful means that it can build a state across its training sequence and even maintain that state when doing predictions. The benefits of using stateful RNNs are smaller network sizes and/or lower training times.
All the RNN or LSTM models are stateful in theory. These models are meant to remember the entire sequence for prediction or classification tasks. However, in practice, you need to create a batch to train a model with backprogation algorithm, and the gradient can't backpropagate between batches.
I recommend you to firstly learn the concepts of BPTT (Back Propagation Through Time) and mini-batch SGD(Stochastic Gradient Descent), then you'll have further understandings of LSTM's training procedure.
For your questions,
Q1. In stateless cases, LSTM updates parameters on batch1 and then, initiate hidden states and cell states (usually all zeros) for batch2, while in stateful cases, it uses batch1's last output hidden states and cell sates as initial states for batch2.
Q2. As you can see above, when two sequences in two batches have connections (e.g. prices of one stock), you'd better use stateful mode, else (e.g. one sequence represents a complete sentence) you should use stateless mode.
BTW, @vu.pham said if we use stateful RNN, then in production, the network is forced to deal with infinite long sequences
. This seems not correct, actually, as you can see in Q1, LSTM WON'T learn on the whole sequence, it first learns sequence in batch1, updates parameters, and then learn sequence on batch2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With