I'm following this tutorial here https://cs230-stanford.github.io/pytorch-nlp.html. In there a neural model is created, using nn.Module
, with an embedding layer, which is initialized here
self.embedding = nn.Embedding(params['vocab_size'], params['embedding_dim'])
vocab_size
is the total number of training samples, which is 4000. embedding_dim
is 50. The relevant piece of the forward
method is below
def forward(self, s):
# apply the embedding layer that maps each token to its embedding
s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim
I get this exception when passing a batch to the model like so
model(train_batch)
train_batch
is a numpy array of dimension batch_size
xbatch_max_len
. Each sample is a sentence, and each sentence is padded so that it has the length of the longest sentence in the batch.
File "/Users/liam_adams/Documents/cs512/research_project/custom/model.py", line 34, in forward s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim File "/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 117, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/functional.py", line 1506, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range at ../aten/src/TH/generic/THTensorEvenMoreMath.cpp:193
Is the problem here that the embedding is initialized with different dimensions than those of my batch array? My batch_size
will be constant but batch_max_len
will change with every batch. This is how its done in the tutorial.
Found the answer here https://discuss.pytorch.org/t/embeddings-index-out-of-range-error/12582
I'm converting words to indexes, but I had the indexes based off the total number of words, not vocab_size
which is a smaller set of the most frequent words.
You've got some things wrong. Please correct those and re-run your code:
params['vocab_size']
is the total number of unique tokens. So, it should be len(vocab)
in the tutorial.
params['embedding_dim']
can be 50
or 100
or whatever you choose. Most folks would use something in the range [50, 1000]
both extremes inclusive. Both Word2Vec and GloVe uses 300
dimensional embeddings for the words.
self.embedding()
would accept arbitrary batch size. So, it doesn't matter. BTW, in the tutorial the commented things such as # dim: batch_size x batch_max_len x embedding_dim
indicates the shape of output tensor of that specific operation, not the inputs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With