I am trying to tag letters in long char-sequences. The inherent structure of the data requires me to use a bidirectional approach.
Furthermore based on this idea I need access to the hidden state at each timestep, not just the final one.
To try the idea I used a fixed length approach. I currently use batches of random pieces of say 60 characters each out of my much longer sequences and run my handmade bidirectional classifier with zero_state
being the initial_state
for each 60-letters-piece.
This worked fine, but obviously not perfectly, as in reality the sequences are longer and the information left and right from the piece I randomly cut from the original source is lost.
Now in order to advance I want to work with the entire sequences. They heavily vary in length though and there is no way I'll fit the entire sequences (batched furthermore) onto the GPU.
I found the swap_memory - parameter in the dynamic_rnn documentation. Would that help?
I didn't find any further documentation that helped me understand. And I cannot really try this out myself easily because I need access to the hidden states at each timestep thus I coded the current graph without using any of the higher level wrappers (such as dynamic_rnn). Trying this out would require me to get all the intermediate states out of the wrapper which as I understand is a lot of work to implement.
Before going through the hassle of trying this out I would love to be sure that this would indeed solve my memory issue. Thx for any hints!
TL;DR: swap_memory
won't let you work with pseudo-infinite sequences, but it will help you fit bigger (longer, or wider, or larger-batch) sequences in memory. There is a separate trick for pseudo-infinite sequences, but it only applies to unidirectional RNNs.
swap_memory
During training, a NN (including RNN) generally needs to save some activations in memory -- they are needed to calculate the gradient.
What swap_memory
does is that it tells your RNN to store them in host (CPU) memory instead of the device (GPU) memory, and stream them back to the GPU by the time they are needed.
Effectively, this lets you pretend that your GPU has more memory than it actually does (at the expense of CPU memory, which tends to be more plentiful)
You still have to pay the computational cost of using very long sequences. Not to mention that you might run out of host memory.
To use it, simply give that argument the value True
.
sequence_length
Use this parameter if your sequences are of different lengths. sequence_length
has a misleading name - it's actually an array of sequence lengths.
You still need as much memory as you would have needed if all your sequences were of the same length (max_time
parameter)
tf.nn.bidirectional_dynamic_rnn
TF includes a ready implementation of bidirectional RNNs, so it might be easier to use this instead of one's own.
Stateful RNNs
To deal with very long sequences when training unidirectional RNNs, people do something else: they save the final hidden states of every batch, and use them as the initial hidden state for the next batch (For this to work, the next batch has to be composed of the continuation of the previous batches' sequences)
These threads discuss how this can be done in TF:
TensorFlow: Remember LSTM state for next batch (stateful LSTM)
How do I set TensorFlow RNN state when state_is_tuple=True?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With