I'm facing the following issue. I have a large number of documents that I want to encode using a bidirectional LSTM. Each document has a different number of words and word can be thought of as a timestep.
When configuring the bidirectional LSTM we are expected to provide the timeseries length.
When I am training the model this value will be different for each batch.
Should I choose a number for the timeseries_size
which is the biggest document size I will allow? Any documents bigger than this will not be encoded?
Example config:
Bidirectional(LSTM(128, return_sequences=True), input_shape=(timeseries_size, encoding_size))
This is because you are using Bidirectional layer, it will be concatenated by a forward and backward pass and so you output will be (None, None, 64+64=128) .
The input of the LSTM is always is a 3D array. (batch_size, time_steps, seq_len) . The output of the LSTM could be a 2D array or 3D array depending upon the return_sequences argument.
This is a well-known problem and it concerns both ordinary and bidirectional RNNs. This discussion on GitHub might help you. In essence, here are the most common options:
A simple solution is to set the timeseries_size
to be the max length over the training set and pad the shorter sequences with zeros. Example Keras code. An obvious downside is memory waste if the training set happens to have both very long and very short inputs.
Separate input samples into buckets of different lengths, e.g. a bucket for length <= 16
, another bucket for length <= 32
, etc. Basically this means training several separate LSTMs for different sets of sentences. This approach (known as bucketing) requires more effort, but currently considered most efficient and is actually used in the state-of-the-art translation engine Tensorflow Neural Machine Translation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With