I'm trying to go seq2seq
with a Transformer model. My input and output are the same shape (torch.Size([499, 128])
where 499 is the sequence length and 128 is the number of features.
My input looks like:
My output looks like:
My training loop is:
for batch in tqdm(dataset):
optimizer.zero_grad()
x, y = batch
x = x.to(DEVICE)
y = y.to(DEVICE)
pred = model(x, torch.zeros(x.size()).to(DEVICE))
loss = loss_fn(pred, y)
loss.backward()
optimizer.step()
My model is:
import math
from typing import final
import torch
import torch.nn as nn
class Reconstructor(nn.Module):
def __init__(self, input_dim, output_dim, dim_embedding, num_layers=4, nhead=8, dim_feedforward=2048, dropout=0.5):
super(Reconstructor, self).__init__()
self.model_type = 'Transformer'
self.src_mask = None
self.pos_encoder = PositionalEncoding(d_model=dim_embedding, dropout=dropout)
self.transformer = nn.Transformer(d_model=dim_embedding, nhead=nhead, dim_feedforward=dim_feedforward, num_encoder_layers=num_layers, num_decoder_layers=num_layers)
self.decoder = nn.Linear(dim_embedding, output_dim)
self.decoder_act_fn = nn.PReLU()
self.init_weights()
def init_weights(self):
initrange = 0.1
nn.init.zeros_(self.decoder.weight)
nn.init.uniform_(self.decoder.weight, -initrange, initrange)
def forward(self, src, tgt):
pe_src = self.pos_encoder(src.permute(1, 0, 2)) # (seq, batch, features)
transformer_output = self.transformer_encoder(pe_src)
decoder_output = self.decoder(transformer_output.permute(1, 0, 2)).squeeze(2)
decoder_output = self.decoder_act_fn(decoder_output)
return decoder_output
My output has a shape of torch.Size([32, 499, 128])
where 32
is batch, 499
is my sequence length and 128
is the number of features. But the output has the same values:
tensor([[[0.0014, 0.0016, 0.0017, ..., 0.0018, 0.0021, 0.0017],
[0.0014, 0.0016, 0.0017, ..., 0.0018, 0.0021, 0.0017],
[0.0014, 0.0016, 0.0017, ..., 0.0018, 0.0021, 0.0017],
...,
[0.0014, 0.0016, 0.0017, ..., 0.0018, 0.0021, 0.0017],
[0.0014, 0.0016, 0.0017, ..., 0.0018, 0.0021, 0.0017],
[0.0014, 0.0016, 0.0017, ..., 0.0018, 0.0021, 0.0017]]],
grad_fn=<PreluBackward>)
What am I doing wrong? Thank you so much for any help.
PyTorch-Transformers (formerly known as pytorch-pretrained-bert ) is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).
The transformer neural network is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. It was first proposed in the paper “Attention Is All You Need” and is now a state-of-the-art technique in the field of NLP.
src_key_padding_mask or key_padding_mask is a matrix that is supposed to mark the padding areas that the layer should not attend to. It is of size batch_size*N.
There are several points to be checked. As you have same output to the different inputs, I suspect that some layer zeros out all it's inputs. So check the outputs of the PositionalEncoding and also Encoder block of the Transformer, to make sure they are not constant. But before that, make sure your inputs differ (try to inject noise, for example).
Additionally, from what I see in the pictures, your input and output are speech signals and was sampled at 22.05kHz (I guess), so it should have ~10k features, but you claim that you have only 128. This is another place to check. Now, the number 499 represent some time slice. Make sure your slices are in reasonable range (20-50 msec, usually 30). If it is the case, then 30ms by 500 is 15 seconds, which is much more you have in your example. And finally you are masking off a third of a second of speech in your input, which is too much I believe.
I think it would be useful to examine Wav2vec and Wav2vec 2.0 papers, which tackle the problem of self supervised training in speech recognition domain using Transformer Encoder with great success.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With