I'm trying train a simple 2 layer neural network with PyTorch LSTMs and I'm having trouble interpreting the PyTorch documentation. Specifically, I'm not too sure how to go about with the shape of my training data.
What I want to do is train my network on a very large dataset through mini-batches, where each batch is say, 100 elements long. Each data element will have 5 features. The documentation states that the input to the layer should be of shape (seq_len, batch_size, input_size). How should I go about shaping the input?
I've been following this post: https://discuss.pytorch.org/t/understanding-lstm-input/31110/3 and if I'm interpreting this correctly, each minibatch should be of shape (100, 100, 5). But in this case, what's the difference between seq_len and batch_size? Also, would this mean that the first layer that the input LSTM layer should have 5 units?
Thank you!
This is an old question, but since it has been viewed 80+ times with no response, let me take a crack at it.
An LSTM network is used to predict a sequence. In NLP, that would be a sequence of words; in economics, a sequence of economic indicators; etc.
The first parameter is the length of those sequences. If you sequence data is made of sentences, then "Tom has a black and ugly cat" is a sequence of length 7 (seq_len), one for each word, and maybe an 8th to indicate the end of the sentence.
Of course, you might object "what if my sequences are of varying length?" which is a common situation.
The two most common solutions are:
Pad your sequences with empty elements. For instance, if the longest sentence you have has 15 words, then encode the sentence above as "[Tom] [has] [a] [black] [and] [ugly] [cat] [EOS] [] [] [] [] [] [] []", where EOS stands for end of sentence. Suddenly, all your sequences become of length 15, which solves your issue. As soon as the [EOS] token is found, the model will learn quickly that it is followed by an unlimited sequence of empty tokens [], and that approach will barely tax your network.
Send mini-batches of equal lengths. For instance, train the network on all sentences with 2 words, then with 3, then with 4. Of course, seq_len will be increased at each mini batch, and the size of each mini batch will vary based on how many sequences of length N you have in your data.
A best-of-both-world approach would be to divide your data into mini batches of roughly equal sizes, grouping them by approximate length, and adding only the necessary padding. For instance, if you mini-batch together sentences of length 6, 7 and 8, then sequences of length 8 will require no padding, whereas sequence of length 6 will require only 2. If you have a large dataset with sequences of widely varying length, that's the best approach.
Option 1 is the easiest (and laziest) approach, though, and will work great on small datasets.
One last thing... Always pad your data at the end, not at the beginning.
I hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With